Tuesday, July 27, 2010

Tahoe project ideas

Tahoe-LAFS is a pretty cool project (which I have blogged about before), but it has some weaknesses. Coincidentally, I've been looking around for a project, so I think I'll try helping out with it.

The current state of Tahoe

Don't get me wrong - as far as cloud storage goes, Tahoe might be the only one I've seen that has some potential. (Maybe I'm setting the bar too high, but maybe everybody else is setting it too low!) If I had to sum up their architecture, though, it looks something like this:

1) A filesystem access layer, which isn't terribly interesting as long as it works
2) A completely badass data management layer, which handles all the tricky bits surrounding security, privacy, and reliability
3) A half-baked distributed storage layer

It's that last layer that I'm interested in, for a lot of reasons. First, there's been a lot of good work in that area over the past 20 years (particularly, a lot of really neat stuff with distributed hash tables 8-10 years ago). Second, it's an area that I've had an interest in for a while, and I think it'd be fun to work on. Third, I think that a more scalable (at a minimum, handling millions of nodes) backend is what's needed to take Tahoe from being a neat and useful program, to being a capital-D Disruptive one. A secure, reliable, fully distributed filesystem running at Internet scale? Oh, hell yes. :D

So with that in mind, here's what I'm thinking as far as possible avenues of work.

Project ideas

  • Tahoe bug #999
  • Multiple storage backends are kind of orthogonal to what I wish Tahoe would do in the long term, but it'd still be really neat.

  • DHT-based storage
  • It would be really cool if Tahoe stored data in a big DHT; several implementations of DHTs exist, but I think that actually integrating one into Tahoe would be a pretty invasive job. File this one under "overambitious" for now.

  • DHT-based storage shim
  • Less ambitious: set up a DHT storage network, and run Tahoe-compatible storage servers on a few of them so that existing clients can use DHT-based storage transparently. This would get us to Storage Nirvana much more quickly, and mainline Tahoe could transition to using a DHT at some later time.

  • Storage availability zones
  • Tahoe falls into the same trap as pretty much every other distributed storage system, in that it assumes that failures are somewhat randomly distributed. In fact, this is almost never the case, thanks to networking: if I'm using Tahoe, then when my flaky home router dies, I will lose access to all my files until I can get reconnected. The goal for this idea is being able to subdivide the available storage servers into availability zones, and distribute shares in a zone-aware way.

    Right now Tahoe lets you specify a replication level as a ratio A/B, and uses an erasure code to generate B shares of data such that the original data can be rebuilt from any A of them. What I'm proposing for availability zones is specifying a separate replication level for each zone (the default, global zone should have A/B, the zone for my home network should have C/D, the zone for some paid cloud storage provider should have E/F, etc etc), generating shares using a (A+C+E+...)/(B+D+F+...) erasure code, and distributing those appropriately.

  • Amazon S3-compatible interface
  • Something is nagging at the back of my mind, saying that this already exists. Wait, no! I think that was Ceph (another cool distributed filesystem, but with totally different goals). S3 seems to be becoming the de facto standard for cloud storage migration, so it'd probably be useful.

    Actually, taken to the extreme, there's no reason there shouldn't be frontends compatible with all storage providers (OpenStack is another interesting one). What's more, if this was implemented in addition to #999, we could solve cloud storage portability and privacy in one fell swoop.

  • Unbreakable static web hosting
  • You could upload a static website to a geographically-diverse Tahoe cluster, and set up multiple frontends (with caching reverse proxies for performance), and put up multiple DNS round-robins over those, and have reasonably fast static web hosting that's completely resilient to hardware failures, natural disasters, and (depending on the geographic distribution) legal threats.

No comments: