HaloD — Dynamic Load Balancing for Orbited
July 10th, 2007 by Michael CarterHaloD is a daemon that performs dynamic load scaling for an orbited cluster.
Before you get too excited, note that this is a proposed project. It is in the planning stage. No code exists yet, and I am not going to tackle the project until I feel that Orbited is 100% ready for production use. (What’s the point of scaling Orbited if you can’t use it?) I welcome any contributions or developers who wish to join the project. This isn’t for the faint of heart though. Implementing HaloD requires writing low-level C code that interfaces directly with the TCP/IP stack in the kernel via the Netfilter project. Another side-effect of this is that HaloD will be platform specific: Linux. This is acceptable to me though because most web apps that use open source software tend to be running on Linux anyway.
With that said, I’ll get on with it.
Name
The name HaloD comes from a combination of the terms “halo orbit” and “daemon”. It is pronounced like the english word “hallowed” but “halo dee” is also an acceptable pronunciation.
“A halo orbit is an orbit around a Lagrange point between two larger bodies. Because the orbit tends to be unstable stationkeeping is required to keep an object such as a satellite in this orbit” — Wikipedia.
How it Works
HaloD accepts incoming connections from the orbited nodes. Whenever an orbited node starts, if configured to do so, it immediately opens a connection with the HaloD server. The HaloD server decides to either 1. Keep the node in reserve, or 2. Add the node to the active cluster.
Browsers and Orbited clients connect to the cluster by connecting to the HaloD machine. For every port that HaloD exposes to browsers, there will be a corresponding port exposed to Orbited Clients (web apps).
The key to understanding HaloD is to think of a typical commodity router. It has a single ip exposed to the outside world, yet allows access to a multitude of intranet computers. It does this by designating certain ports to certain machines and keeping track of which ports map to which machines. So if machine1 makes a request to google.com, the router sends the request out on port 23432, and remembers that incoming traffic on port 23432 maps to a local port on machine1.
Similarly, HaloD is a custom router. It listens on a range of ports, and then maps them to the Orbited cluster nodes behind it. The reason this is helpful is that the end destination of a request to port 500, let’s say, might change, but the browser or web app is none the wiser. Hopefully the following diagram will help you visualize HaloD.

The diagram only shows how browsers connect to HaloD and are forwarded to Orbited nodes, but the same method is used for web applications connecting to HaloD.
Routing Algorithm
Assume for a moment that there is a max of N nodes; let N be 24. For the first node, no matter which of the 24 ports you connect to HaloD on (for either the browser or the client), the packets will be routed to the one (and only) orbited node. So there are 24 routes to a single orbited node. When a second node is added, 12 of the routes are moved over. They are now directed at the second node. When a third node is added, 1/3 of the routes are moved to the third node.
It wouldn’t do the choose any arbitrary routes to move upon adding a node, we need to do it systematically. The formula to determine which routes are moved is as follows:
Algorithm setup
Assign an id from 1 to N to each route.
Adding each new node (algorithm)
- The new node is given an id of M + 1 where M is the largest id of an existing node (if no nodes are present then M is 0)
- If no other nodes exist, all N routes are assigned to the new node. Finish.
- Sort the nodes by number of routes assigned (k), and secondly by node id
- Moving in descending order down the list, move assignment of the highest route id for the given node to the new node until floor(N / (M + 1)) routes have been reassigned.
- If this quota hasn’t been met but the list has been compeltely traversed, then start at the top and repeat the process (step 4)
- For all routes reassigned, send a ‘route moved’ message to each route’s former orbited node. The orbited node will issue a reconnect event to any users connected via that route if necessary. (It is necessary for the Iframe transport, but not for the XHR transport. For XHR just close all open connections and the browsers will re-connect.)
Progression of Routes as Nodes are added
The result of the above algorithm can most easily be understood by examining a picture. If we have 24 routes, the following diagram shows which routes are assigned to each of n nodes, for n between 1 and 8.

Failover
A major reason to use dynamic load balancing is to provide a failover mechanism. If one of your nodes fails, you don’t want to know about it right away. Rather, you want the entire system to continue functioning and you want as little downtime for any user as possible. Here is how we do this with HaloD:
If HaloD loses connection with an orbited node and is not able to immediately re-establish connection, then it reassigns all routes from that node to a failover node. This is the quickest and easiest way to keep service up at all times.
But sometimes you don’t have a failover node available. In this case, HaloD reassigns the node with the highest id (node M) to replace the failed node, then uses a reverse of the route reassignment process describe above to reassign all former node M routes to the appropriate orbited nodes given that we now have total nodes = N = M-1. Note that the client-side orbited implementation (javascript) needs to support automatic reconnection if the connection is lost. Because HaloD immediately recognizes an Orbited node failing, it can swap the routes in a matter of a few cpu cycles, so before the browsers even realize their Orbited Node is down, it’ll be back up.
Implementation Details
As I said before, this project requires per-packet processing, which is only possible at the kernel level, either directly or via the Netfilter project. That means that HaloD must be written entirely in low-level C. We also need to work at a level a bit higher than packet processing, as HaloD needs to accept incoming connections from the Orbited nodes for monitoring purposes. Assuming we want to put 1024 nodes in a single cluster, then we need to quickly process heartbeat signals from each of the nodes. This isn’t a lot of bandwith, but we want it to be snappy. I think the best solution here is to use either libevent or epoll directly. This should work well with our low latency, high concurrency requirements. And who knows, eventually someone may want to put 2048 nodes or more in a single cluster.
The routing table itself is just a hash table in memory. It may be a good idea to dump the routing table to a database or network device every time it’s updated. This allows the possibility of creating real-time failover solutions for the HaloD server itself.
Conclusion
I know this may seem an ambitious project at first glance, but the hardest part is testing it with a realistic setup. The routing algorithm is simple, as will be the heartbeat protocol. The project is necessary though to the future of real-time web applications. Maybe not HaloD in particular, but a similar solution will be needed, and sooner than you might imagine. I think that the hardest part will be attracting web-developers who understand the need for such a system, and are comfortable working in C. Most web developers today have little history with low-level languages — as it should be.

December 31st, 2007 at 5:13 pm
[...] time ago I proposed a project called HaloD with the intent of solving this problem. The basic idea is to pre-allocate a cluster as having an [...]
April 4th, 2008 at 7:23 pm
I couldn’t understand some parts of this article HaloD — Dynamic Load Balancing for Orbited, but I guess I just need to check some more resources regarding this, because it sounds interesting.