Friday, September 11, 2009

The coming real-time web

It all started in 2001, as fas as I can tell, in the early days of RSS. There was an optional part of the RSS spec - the <cloud> tag - which could do some neat stuff. It specified a protocol by which a subscriber to an RSS feed could also get near-instantaneous notifications whenever a feed updated, via a "cloud" server. Unfortunately, this part of RSS never really caught on, and without support from any feeds or software, it was basically dead in the water, and everybody kind of forgot it existed.

Fast forward to early summer of this year...

Sometime at the beginning of this summer, in May or June, Google announced their protocol for real-time notifications, called PubSubHubbub. (PubSub is shorthand for publish/subscribe, I guess. Also, you can acronym it to PuSH, and it uses push notifications. Neat!) PubSubHubbub is a really nice piece of design work - I can almost smell the scalability when I read it. (But more on the design issues later.) A bit later, in mid-July or so, Dave Winer (whose blog I follow) decided that it'd be neat to get the old RSS cloud running again. He starts writing code, and putting up test feeds, and people start showing up to the party.

Fast forward to today...

PubSubHubbub is live on LiveJournal and Blogger (including this blog!), while rssCloud is live on Wordpress. The idea has been around for years, but suddenly, in the past month or two, real-time notifications for blogs have taken off in a big way. It's no fun if no aggregators support them, of course, which is why I implemented rssCloud support in my personal aggregator this week. (Yes, I wrote my own aggregator. All the other ones I tried out sucked. >_>) PubSubHubbub support is coming later, once I work out all the current bugs.

Design Issues

Here's a quick rundown of how RSS <cloud> and PubSubHubbub work. Unless you care deeply about how RSS and Atom feeds work, you can probably skip this section. :) I'm only covering the subscriber side of the protocols here, because that's all I really care about.

Broad overview: Both protocols are detectable in feeds, and both can work with either RSS or Atom feeds. When a subscriber detects a real-time protocol in a feed, it can notify (and periodically re-notify) the specified "hub" server that it wants to receive updates. Later, when an update happens, the hub server sends a notification to each subscriber.

In rssCloud, subscribers can detect the <cloud> tag in an RSS feed. PuSH, on the other hand, uses a standard <link> tag, with attribute rel="hub". They both contain the URL of a hub server that the client can contact. Here's the first major philosophical difference between the two: rssCloud allows several communication protocols (HTTP POST, XML-RPC, SOAP), while PuSH only uses HTTP POST. Standardizing on one protocol takes some flexibility away from implementors, but also simplifies things since they only have to worry about one protocol. Overall, I'd say this is a win for PuSH.

The client sends a notification to the server that the client would like to be notified of changes, using the specified method. In PuSH, the client needs to specify a URL that it can be reached at. rssCloud, on the other hand, has the server detect what the client's address is based on where the request came from. This is incredibly convenient for clients behind a NAT, since they have no good way of knowing their own address.

PuSH also allows the client to specify a "secret", which the server can then use to send authenticated updates to clients. This brings us to what's probably the most fundamental difference between the two specs: In PuSH, updates from the hub are authoritative - they contain actual feed content that clients can use, bypassing the feed entirely. In rssCloud, on the other hand, notifications just tell clients to update the feed, so the notifications themselves don't have to be trusted. This greatly simplifies implementation of rssCloud, but it comes at a cost. Whenever a <cloud>-enabled RSS feed updates, it's going to get hit by every <cloud>-enabled client at the same time, when they're notified. This "flash crowd" is exactly the problem that Google was trying to solve by having the notifications contain authoritative feed content. It's a tradeoff, and it's hard to say which paradigm is better - static file hosting can be made to scale, after all. Luckily, both are in competition now, so time will tell.

So, time passes, and then a feed that you're subscribed to updates! Like I said, I'm not covering the publish side of the protocol, so let's assume that the hub finds out about the update via voodoo magic. In the rssCloud world, the hub goes and sends the URL of the feed that updated to every subscriber, via whatever method they requested. If a feed is known by multiple URLs, the hub deals with that, and sends each subscriber whichever URL they asked for. Subscribers can then update the feeds immediately. In PuSH, on the other hand, each feed has a canonical URL (specified by the <link rel="self"> tag, which is mandatory in PuSH), and it sends actual feed content in the notifications, not just feed URLs.

In the end, the differences between the two specs come down to allocating complexity. PuSH is a bit more complex, but it's in ways that will help it scale better in the long run.

Implementation

I did my implementation in Python, because everything I write for fun these days is in Python. My first shot at it was Tuesday night, and it started working in the wee hours of the morning, but the design was freaking terrible. >_> They say that you should throw away your first implementation of anything you write, and that certainly applied here. Wednesday afternoon I took another shot at it, got the design down, tweaked the design so that there'd be as little duplication of effort as possible when I implement PuSH later on, and got it working by Wednesday night.

My implementation uses XML-RPC, because gosh, it's just so easy to use in Python. I ended up having to add support for HTTP requests, because it seems like all the feeds available online use those exclusively, go figure. I probably could have gotten better results by using the excellent Twisted library, but I decided I'd rather not add any dependencies to my reader, so I just used the standard Python libraries instead.

Future directions

This is neat and all, but what are the consequences of this technology going to be? Dave Winer seems dead set on creating a completely decentralized Twitter clone using RSS, with real-time notifications using <cloud>. It'll be interesting if he manages it, but I really think that cloning Twitter is setting the bar too low. :p It's a natural goal, though - Twitter is more fast-paced than blogs, so adding real-time notifications to blogs could very well push in the direction of shorter, more frequent posts.

Connectivity is going to be a problem with this, going forward, since every client needs to be able to accept incoming connections, which hasn't been generally possible since NAT broke the Internet. There are a few possible solutions. The best, obviously, is to just roll out IPv6, so that every client on the Internet is reachable again. Another possibility would be to have RSS "proxies" that are reachable, and that communicate updates to clients by some other means. This could be clients connecting to the proxy and just getting update notifications as bare URLs in a stream, or alternately it could just throw up a static RSS feed that clients can poll more frequently. Both approaches have their pros and cons, but I personally favor the former.

In order to keep up with the accelerating pace of feeds, I predict that the more advanced RSS clients are going to include interfaces to popular blogging platforms, so that you can get your reactions to the news on the web more quickly (analogous to retweeting on Twitter). For this to really work nicely, we're going to need some way of tracking discussions that happen through blog posts. Automatically linking back to the post you're replying to is a trivial first step, and one that's already pretty well established by convention. Backlinks allow you to see which posts are replies to a given post, but they're not consistently used these days due to the potential for spam.

Real-time news is going to be fun. CNN is already on it; they're using Wordpress's rssCloud support to run a real-time enabled news feed. News being brought to your desktop within seconds of it being reported. This is the sort of thing the Internet is capable of giving us; it's certainly taken us a while to catch on.

No comments: