Author Archive

HaloD — Dynamic Load Balancing for Orbited

Tuesday, July 10th, 2007

HaloD is a daemon that performs dynamic load scaling for an orbited cluster.

Before you get too excited, note that this is a proposed project. It is in the planning stage. No code exists yet, and I am not going to tackle the project until I feel that Orbited is 100% ready for production use. (What’s the point of scaling Orbited if you can’t use it?) I welcome any contributions or developers who wish to join the project. This isn’t for the faint of heart though. Implementing HaloD requires writing low-level C code that interfaces directly with the TCP/IP stack in the kernel via the Netfilter project. Another side-effect of this is that HaloD will be platform specific: Linux. This is acceptable to me though because most web apps that use open source software tend to be running on Linux anyway.

With that said, I’ll get on with it.

Name

The name HaloD comes from a combination of the terms “halo orbit” and “daemon”. It is pronounced like the english word “hallowed” but “halo dee” is also an acceptable pronunciation.

“A halo orbit is an orbit around a Lagrange point between two larger bodies. Because the orbit tends to be unstable stationkeeping is required to keep an object such as a satellite in this orbit” — Wikipedia.

How it Works

HaloD accepts incoming connections from the orbited nodes. Whenever an orbited node starts, if configured to do so, it immediately opens a connection with the HaloD server. The HaloD server decides to either 1. Keep the node in reserve, or 2. Add the node to the active cluster.

Browsers and Orbited clients connect to the cluster by connecting to the HaloD machine. For every port that HaloD exposes to browsers, there will be a corresponding port exposed to Orbited Clients (web apps).

The key to understanding HaloD is to think of a typical commodity router. It has a single ip exposed to the outside world, yet allows access to a multitude of intranet computers. It does this by designating certain ports to certain machines and keeping track of which ports map to which machines. So if machine1 makes a request to google.com, the router sends the request out on port 23432, and remembers that incoming traffic on port 23432 maps to a local port on machine1.

Similarly, HaloD is a custom router. It listens on a range of ports, and then maps them to the Orbited cluster nodes behind it. The reason this is helpful is that the end destination of a request to port 500, let’s say, might change, but the browser or web app is none the wiser. Hopefully the following diagram will help you visualize HaloD.

HaloD Architecture

The diagram only shows how browsers connect to HaloD and are forwarded to Orbited nodes, but the same method is used for web applications connecting to HaloD.

Routing Algorithm

Assume for a moment that there is a max of N nodes; let N be 24. For the first node, no matter which of the 24 ports you connect to HaloD on (for either the browser or the client), the packets will be routed to the one (and only) orbited node. So there are 24 routes to a single orbited node. When a second node is added, 12 of the routes are moved over. They are now directed at the second node. When a third node is added, 1/3 of the routes are moved to the third node.

It wouldn’t do the choose any arbitrary routes to move upon adding a node, we need to do it systematically. The formula to determine which routes are moved is as follows:

Algorithm setup

Assign an id from 1 to N to each route.

Adding each new node (algorithm)

  1. The new node is given an id of M + 1 where M is the largest id of an existing node (if no nodes are present then M is 0)
  2. If no other nodes exist, all N routes are assigned to the new node. Finish.
  3. Sort the nodes by number of routes assigned (k), and secondly by node id
  4. Moving in descending order down the list, move assignment of the highest route id for the given node to the new node until floor(N / (M + 1)) routes have been reassigned.
  5. If this quota hasn’t been met but the list has been compeltely traversed, then start at the top and repeat the process (step 4)
  6. For all routes reassigned, send a ‘route moved’ message to each route’s former orbited node. The orbited node will issue a reconnect event to any users connected via that route if necessary. (It is necessary for the Iframe transport, but not for the XHR transport. For XHR just close all open connections and the browsers will re-connect.)

Progression of Routes as Nodes are added

The result of the above algorithm can most easily be understood by examining a picture. If we have 24 routes, the following diagram shows which routes are assigned to each of n nodes, for n between 1 and 8.

HaloD node routing algorithm

Failover

A major reason to use dynamic load balancing is to provide a failover mechanism. If one of your nodes fails, you don’t want to know about it right away. Rather, you want the entire system to continue functioning and you want as little downtime for any user as possible. Here is how we do this with HaloD:

If HaloD loses connection with an orbited node and is not able to immediately re-establish connection, then it reassigns all routes from that node to a failover node. This is the quickest and easiest way to keep service up at all times.

But sometimes you don’t have a failover node available. In this case, HaloD reassigns the node with the highest id (node M) to replace the failed node, then uses a reverse of the route reassignment process describe above to reassign all former node M routes to the appropriate orbited nodes given that we now have total nodes = N = M-1. Note that the client-side orbited implementation (javascript) needs to support automatic reconnection if the connection is lost. Because HaloD immediately recognizes an Orbited node failing, it can swap the routes in a matter of a few cpu cycles, so before the browsers even realize their Orbited Node is down, it’ll be back up.

Implementation Details

As I said before, this project requires per-packet processing, which is only possible at the kernel level, either directly or via the Netfilter project. That means that HaloD must be written entirely in low-level C. We also need to work at a level a bit higher than packet processing, as HaloD needs to accept incoming connections from the Orbited nodes for monitoring purposes. Assuming we want to put 1024 nodes in a single cluster, then we need to quickly process heartbeat signals from each of the nodes. This isn’t a lot of bandwith, but we want it to be snappy. I think the best solution here is to use either libevent or epoll directly. This should work well with our low latency, high concurrency requirements. And who knows, eventually someone may want to put 2048 nodes or more in a single cluster.

The routing table itself is just a hash table in memory. It may be a good idea to dump the routing table to a database or network device every time it’s updated. This allows the possibility of creating real-time failover solutions for the HaloD server itself.

Conclusion

I know this may seem an ambitious project at first glance, but the hardest part is testing it with a realistic setup. The routing algorithm is simple, as will be the heartbeat protocol. The project is necessary though to the future of real-time web applications. Maybe not HaloD in particular, but a similar solution will be needed, and sooner than you might imagine. I think that the hardest part will be attracting web-developers who understand the need for such a system, and are comfortable working in C. Most web developers today have little history with low-level languages — as it should be.

Orbited Paper Delayed

Wednesday, June 20th, 2007

Orbited Paper Delayed

I mentioned previously that I was working on a paper about Orbited. It includes pretty diagrams and succinct technical explanations and comparisons to other technologies using good, hard data.

Unfortunately, that paper is going to be delayed. I’ve got a lot of stuff going on, with the new job, and I’ve had to travel a fair amount recently for various reasons. More importantly though, my Co-author Professor Art Lee has and will be occupied for a various number of reasons including the purchase of a new house and his kids coming home from school for the summer.

It’s okay though — There aren’t many conferences with upcoming due dates that I’m interested in debuting Orbited to academia. I’ll keep you posted on news about the paper. I expect work to resume in late August or early September.

Call for Logos

Friday, June 15th, 2007

I’ve heard some complaints about the current Orbited logo. For one, it’s missing the “ed” in Orbit_ed_. Okay, I’m sorry — A friend put that together in just a few minutes. So if anyone out there wants to do a better job, I am all for that. I am low on time myself and I think I should spend the little time I do have strategically.

The Logo is often the first impression a prospective user has about a project. We want something that conveys a sense of community, but without looking too relaxed. We don’t want to scare away the Enterprisey folks right off the bat. So something that’s cute, but clearly professionally done. [Update, 2007-08-13: A new logo has been designed.]

Announcing Orbited 0.1.3

Tuesday, June 5th, 2007
  • New pluggable transport System: Uses setuptools; Write transports as seperate modules and it finds entry points at run time
  • Added iframe_alert: Not much of a point, but shows how you can control the browser remotely
  • Tutorial: using orbited with ruby
  • Tutorial: RailsChat (thanks to Jacob Rus!)
  • IRC Channel on freenode: #orbited
  • Demo: WebIRC (Thanks to Mario Balibrera!)

I would also like to note that I’ve begun work on an Orbited Paper, with my esteemed colleague Professor Art Lee who is an expert on distributed Java systems. He’s interested in getting into python and he’s also interested in comparing Orbited against the Java Cometd implementation.

Announcing Orbited 0.1.2

Thursday, May 3rd, 2007

Announcing Orbited 0.1.1

Tuesday, April 24th, 2007

Announcing Orbited 0.1.1

  • New Transport type: iframe
  • Rewrote Proxy
  • keepalive was messing it
  • Old method: read headers, move connection to proxy mode. Problematic due to keep alive
  • New method: read all data being proxied
  • Result: Keepalive works :-) but More Cpu :-(
  • New Demo: Cherrychat

The Failure of Cometd

Thursday, April 19th, 2007

Before and during the process of developing the concepts and architecture behind Orbited, I spent some time chatting with the folks in #cometd on irc.perl.org. I was a bit confused by the project for a while — I knew what problem domain they were attempting to solve, but I wasn’t sure how exactly they solved it. I was specifically confused on a number of points:

  1. Why does Cometd have a different version of the server for every language?
  2. How is the web-app supposed to send events to the browser?
  3. How do I communicate with the Cometd server from Javascript?
  4. How do I to run multiple Cometd nodes in a sensible way?

I’m going to give my best bet at answering each of these questions.

  1. Why does Cometd have a different version of the server for every language?

    Ultimately there will be one or two Cometd implementations to rule them all. And probably they’ll be written in Java. In the meantime, smart hackers are using whatever language they know best to experiment. There are multiple implementations because no one really knows what will work in the end.

  2. How is the web-app supposed to send events to the browser?

    The web application is supposed to pretend to be a browser that speaks Bayeux. So if the application wants to receive events, it will be using an iframe stream or long polling. If it wants to subscribe, unsubscribe, or publish, it will make json encoded Bayeux requests to the cometd server.

  3. How do I communicate with the Cometd server from JavaScript?

    The obvious answer is to use Bayeux. But it’s really hard to implement and you aren’t expected to do so. The guys at Dojo Toolkit have already implemented Bayeux, and presumably there are more to follow. So the short answer is, use Dojo Toolkit.

  4. How do I to run multiple Cometd nodes in a sensible way?

    I sort of have an answer, but I don’t fully understand it. I talked to some guys in #cometd on this issue, and the first response was “Good question… I don’t really know.” I did some research on the web and noticed this thread on a Cometd-user google group. The long and the short of it is “No one knows.” But on the other hand, Greg Wilkins told me that this is an easy problem to solve. You simply stick a Java Message Queue (JMS) behind your Cometd nodes and you’re done. I’m maybe a quarter willing to buy this, but in the face of this post by Greg, I have some doubts. I understand that Terracotta is completely unrelated to JMS, but it indicates that attempts at scaling aren’t going well. So the short answer is, design your app so that clusters of channels and users are isolated from each other such that each Cometd server supports a specific set of users/groups, and only those users and groups.

I hope my answers help someone who is interested in Cometd. Now I’m going to justify this post’s title. I have a number of problems with Cometd.

  1. No easy way for my web app to send an event to a specific user.
  2. You are locked into Dojo
  3. There is no possible method to pre-process events, short of rewriting the Cometd server
  4. There currently exists no security.
  5. Lacks App-based configurability (or any sort of configuration)
  6. Cometd is Inherently difficult to scale.

I’ll even substantiate my claims.

  1. No easy way for my web app to send an event to a specific user.

    Any server whose purpose is comet-like communication has two points of interaction for the end developer: The web application and the browser. These are the most important parts because they dictate how a server can or cannot be used. Furthermore, the end-developer only ever sees these two parts of the server. If they are complicated, it doesn’t matter how beautiful the internal implementation is. APIs that are complicated or hard to implement are generally not well received and take longer to be adopted.

    So let’s say that I want to use Cometd with a currently unsupported language for my web application… Lisp. I have to implement a Bayeux client in Lisp. Now, I’ve spent half a year thinking about the problem space of Comet, read the Bayeux spec through multiple times, conversed with the authors of Bayeux, and even closely analyzed the source code of the twisted Bayeux implementation, as well as the Dojo client implementation. I am still lost. I will not be implementing Bayeux in Lisp, or any other language for that matter. Sure, eventually someone will after Cometd is popular, but what do I do in the meantime? And what if I don’t like the implementation that exists for a given language? Nothing, I’m stuck. It’s not as god awful as being stuck in some proprietary system, but it is still terrible in its own right.

  2. Locked into Dojo

    The same logic applies for choice of Javascript toolkit. If you want to use Cometd, then you have to use Dojo Toolkit. Personally, I have little problem with that. But someone has a problem with it, otherwise everyone would already be using Dojo. And because it’s the Dojo guys who are pushing Bayeux, we aren’t likely to see them write an implementation for a competing framework.

  3. There is no possible method to pre-process events, short of rewriting the Cometd server

    Lets take a moment to examine the flow of events through a Cometd server. Here is my understanding of the Cometd stack:

    Cometd stack

    Lets say that user A sends a chat msg to channel 1, which contains user A, B, C, and the webapp. The event is sent from User A’s browser to Cometd. Cometd takes that message and delivers it to user B, C and the webapp. But lets also say that user C is in Chile where they don’t speak English. We want our application to translate the message on the fly from English to Spanish. Unfortunately, the Web App isn’t the hub, its just a client like all the browsers. It can receive notification of the message, but it can’t alter the message before it gets to the other users.

  4. There currently exists no security.

    There is no bar on who can join what channel, or even on who can publish to a given channel. In fact, check out the chat Cometd demo. Pay particular attention to line 51 and 52. This is the leave function, for leaving a chat room. Line 51 shows the browser unsubscribing from the chat channel, and line 52 shows the browser publishing a leave message to the channel. You don’t even have to be in a channel to publish to it. Is this by design, and if so what’s the advantage? I see this is a gaping security hole.

  5. Lacks App-based configurability (or any sort of configuration)

    Okay, let’s entertain the thought for a moment that Cometd is really intended to allow users to publish to channels they aren’t subscribed to. But now we want to make certain rooms private. How do we do this? Well, right now you can’t. But assume that it became a configuration option. How would the web app tell Cometd which rooms to make private or public? There are a zillion other settings that need to be created and modified on the fly by the web app, but that doesn’t exist in Cometd. There are no hooks for the app. Your only bet is to reprogram the Cometd daemon itself, which is not very realistic. How do you kick a user out of a channel? How do you moderate a channel? How do you set a channel size limit, and change it as the app runs? I could keep asking questions, but the answer to all of them is: You cannot.

  6. Cometd is Inherently difficult to scale.

    All of the previous problems are small beans though compared to my real problem: Cometd is inherently hard to scale. I mean, more so than Comet-style communication is inherently hard to scale. The reason is the publish subscribe mechanism that is the heart of Cometd. Every node in a Cometd cluster needs to contain information about every group. It needs to additionally know about every user and their membership in every group, or alternatively, relay every event to every server in the cluster. You’ll notice i used the word ‘every’ an awful lot just now, and that’s a problem. Every * every = n^2 of something, bandwidth or cpu, it doesn’t really matter. It’s difficult to scale an O(n^2) system. The cost of the second user is going to be a thousand times less than the cost of the thousandth user.

I honestly don’t understand why Cometd feels the need to tackle that problem. I’m not saying publish/subscribe isn’t ever necessary in a web app, rather that it’s a layer that doesn’t belong in a Comet server. Cometd shouldn’t be in the pubsub business… that’s a tough business. As if Comet wasn’t tough enough already.

In the end I think that Cometd is making amazing strides in advancing state of the art technology on the web, but it doesn’t mean we should be blind to alternatives.

Announcing PyOrbited 0.1

Sunday, April 15th, 2007

I’ve had a few versions of python orbited clients floating around, so I decided to package them together and make an egg. I plan on making tutorials, so this is a good step towards those ends. You can use setuptools easy_install to get the pyorbited package, or download it directly from [cheeseshop].

Announcing Pyorbited 0.1.0

Pyorbited is a collection of python orbited clients. They can be used in conjunction with Orbited (www.brbx.com/orbited) to enable real-time communication in your web apps.

The current implementations are:

  • Pyevent-orbited
  • Twisted-orbited
  • Python-orbited

Twisted + Orbited = Crazy Delicious

Friday, April 13th, 2007

There are all sorts of fun prototypes you can build using orbited and your favorite web framework. Unfortunately, most of these applications will never scale past a few hundred users on a single node. The reason is that your favorite web framework is probably pretty bad at network IO. It almost certainly dispatches each request to its own thread. And to add insult to injury, these applications probably won’t be very interesting.

Can your web framework create an IRC connection for every user logged into the system? Can it do it fast? Can it do it for more than 24 users? Well, guess what: Twisted can do all of that. For hundreds, if not thousands of users. It is event based, highly scalable, and has almost more protocol implementations than existing protocols.

Both Orbited and Twisted are event based. Both are targeted at low-latency and high scalability applications. Orbited is about receiving real-time notifications; Twisted is about connecting to real-time sources. If you want to connect X to orbited, where X is anything on the internet, then Twisted can do it and do it well.

To this end, I’ve written a twisted-orbited client (Update: This is now part of the pyorbited project) and done some rudimentary tests. I sense the beginning of a great relationship.

Share Nothing and Orbited

Monday, April 9th, 2007

I’ve been putting out a lot of propaganda about how scalable Orbited is. I think its time I explain why Orbited is scalable, and what the cost of that scalability is.

Let me first direct your attention to memcached, a brilliant distributed caching system from the livejournal guys. If you haven’t heard about it, go read and come back. The short description is that it acts like a distributed hash table. Each piece of data hashes to a specific server, so you always know where to find data, and memcached nodes don’t ever share information.

This is known as share nothing architecture. The purpose behind it is the cut down on intra-node dependance, and ultimately to allow scaling an application laterally. The classic version of this is web application sessions. If you store session data on each server node then as users access other nodes they need to contact the “home” node of that user to ask for session information. While this works, you end up with increasing overhead of each additional server. So adding the 10th node might not actually get you any performance increase because of the overhead in exchanging session state information with the other 9 nodes.

As far as Comet style communication goes, the main problem is maintaining state information about channels in the publish/subscribe paradigm. Cometd uses this pattern. Other examples include IRC and Jabber. The basic idea is that users can subscribe to channels, and publish to channels. Upon publishing an event to a channel, every member of that channel receives the message.

This is seductively trivial to implement on a single server. But what happens when you add the second server? You have to figure out some way to partition your user base. Partitioning by channel wouldn’t make much sense. Let’s say you had users subscribed to channel A connect to server A, and users subscribed to channel B connect to server B. What if all users need to connect to both channels? Then all users connect to both servers, and you’re back to square one.

Okay, then let’s partition based on users. Half the users will connect to server A, and the other half to server B. This works a bit better, but creates an additional problem. The crux of that obstacle is this: If you want to dispatch an event to channel A, to which server do you send that event? The answer, inevitably, is both servers. We’ve solved the public-side of our scaling problem nicely. I mean, the bandwidth coming from the public to us scales linearly. But our back-end suffers the consequences.

Long before we reach a respectably sized server cluster, our intranet will be completely saturated by our webapp dispatching events. On top of that, our webapp will have some serious cpu usage issues what with all the tcp connections and event dispatching to each of the Comet nodes. We could abstract a bit — use a central dispatching server, or put logic on each node in our Comet cluster such that each node will replicate an event to the other nodes. But no matter how you slice it you’ll end up with an unscalable solution, as event data needs to be re-exchanged an increasing amount with each additional node.

Disclaimer: I’m simplifying the explanation of scaling publish/subscribe to some extent. There are all sorts of graph theory experts who know great ways to solve some of these problems. So I’m not saying that we should throw out publish/subscribe. Rather, I’m saying that it’s damn hard to scale. I doubt we’ll see a truly scalable and open source comet server that provides a good publish/subscribe mechanism.

It’s time to see how Orbited fits in to this mess. You can think of Orbited the same way you would think of Memcached. Except, instead of distributing pieces of data across nodes, we’re distributing Comet connections. Each connection has a unique id, and based on that id the browser is assigned to a particular node. Each web app node uses the same hashing algorithm to dispatch events, so they always know where every user is without communicating with other nodes.

Orbited does not handle publish/subscribe whatsoever. There is no way for us to do that and still scale linearly. But as a result, every time we send an event from our application to a user, it takes a single path through our intranet to their browser.

So now the bad news. Most applications benefit from publish/subscribe. Even a simple stock ticker or chat application. And Orbited doesn’t provide that support. Our philosophy is that we aren’t experts at scaling publish/subscribe so we’ll let someone else do it. For example Jabber, or IRC. So if you want to use publish/subscribe, then you need to add an additional layer to your stack. This is going to be frustrating for newcomers who want to create a prototype and leave worrying about all the scaling rigamarole until later, or never. As a result, many applications, even my most basic demo app (UPDATE: this demo app has become cherrychat) have to roll their own subscribe based system.

I don’t want to end with the bad news, so I’m going to point out an academic paper from 2004 titled “Building Content-Based Publish/Subscribe Systems with Distributed Hash Tables” which discusses the future of scaling publish/subscribe on the internet. The authors propose that we use distributed hash tables as the basis of scalable publish/subscribe systems. So, rest assured that by adopting Orbited you are on the right track. It may not be a full stack solution, but it’s fast and scalable and sometimes those concerns trump all else.