Tuesday, July 27, 2010

Tahoe project ideas

Tahoe-LAFS is a pretty cool project (which I have blogged about before), but it has some weaknesses. Coincidentally, I've been looking around for a project, so I think I'll try helping out with it.

The current state of Tahoe

Don't get me wrong - as far as cloud storage goes, Tahoe might be the only one I've seen that has some potential. (Maybe I'm setting the bar too high, but maybe everybody else is setting it too low!) If I had to sum up their architecture, though, it looks something like this:

1) A filesystem access layer, which isn't terribly interesting as long as it works
2) A completely badass data management layer, which handles all the tricky bits surrounding security, privacy, and reliability
3) A half-baked distributed storage layer

It's that last layer that I'm interested in, for a lot of reasons. First, there's been a lot of good work in that area over the past 20 years (particularly, a lot of really neat stuff with distributed hash tables 8-10 years ago). Second, it's an area that I've had an interest in for a while, and I think it'd be fun to work on. Third, I think that a more scalable (at a minimum, handling millions of nodes) backend is what's needed to take Tahoe from being a neat and useful program, to being a capital-D Disruptive one. A secure, reliable, fully distributed filesystem running at Internet scale? Oh, hell yes. :D

So with that in mind, here's what I'm thinking as far as possible avenues of work.

Project ideas

  • Tahoe bug #999
  • Multiple storage backends are kind of orthogonal to what I wish Tahoe would do in the long term, but it'd still be really neat.

  • DHT-based storage
  • It would be really cool if Tahoe stored data in a big DHT; several implementations of DHTs exist, but I think that actually integrating one into Tahoe would be a pretty invasive job. File this one under "overambitious" for now.

  • DHT-based storage shim
  • Less ambitious: set up a DHT storage network, and run Tahoe-compatible storage servers on a few of them so that existing clients can use DHT-based storage transparently. This would get us to Storage Nirvana much more quickly, and mainline Tahoe could transition to using a DHT at some later time.

  • Storage availability zones
  • Tahoe falls into the same trap as pretty much every other distributed storage system, in that it assumes that failures are somewhat randomly distributed. In fact, this is almost never the case, thanks to networking: if I'm using Tahoe, then when my flaky home router dies, I will lose access to all my files until I can get reconnected. The goal for this idea is being able to subdivide the available storage servers into availability zones, and distribute shares in a zone-aware way.

    Right now Tahoe lets you specify a replication level as a ratio A/B, and uses an erasure code to generate B shares of data such that the original data can be rebuilt from any A of them. What I'm proposing for availability zones is specifying a separate replication level for each zone (the default, global zone should have A/B, the zone for my home network should have C/D, the zone for some paid cloud storage provider should have E/F, etc etc), generating shares using a (A+C+E+...)/(B+D+F+...) erasure code, and distributing those appropriately.

  • Amazon S3-compatible interface
  • Something is nagging at the back of my mind, saying that this already exists. Wait, no! I think that was Ceph (another cool distributed filesystem, but with totally different goals). S3 seems to be becoming the de facto standard for cloud storage migration, so it'd probably be useful.

    Actually, taken to the extreme, there's no reason there shouldn't be frontends compatible with all storage providers (OpenStack is another interesting one). What's more, if this was implemented in addition to #999, we could solve cloud storage portability and privacy in one fell swoop.

  • Unbreakable static web hosting
  • You could upload a static website to a geographically-diverse Tahoe cluster, and set up multiple frontends (with caching reverse proxies for performance), and put up multiple DNS round-robins over those, and have reasonably fast static web hosting that's completely resilient to hardware failures, natural disasters, and (depending on the geographic distribution) legal threats.

Saturday, July 17, 2010

Favorite thing about working at Microsoft

Well, I can't really pick just one thing, so I'm just going to list a lot of awesome things.

  • Free beer at company events
  • If I seriously wanted to, I could get drunk at work every other week on Microsoft's dime. Nobody actually does, of course, because that'd be idiotic, but there's certainly enough beer that you could.

  • Working with extremely geeky people
  • Today: Me: "Yeah, waiting doesn't really count as work." Coworker: "Unless you're a spinlock! :D" If you don't get it, it would take way too long to explain. XD

  • Completely awesome company events
  • The big intern event for the summer was taking us all to go see a private showing of Cirque du Soleil. (Oh, and while we were there? They gave us all free Zune HDs.) Microsoft is trying pretty hard to make a good impression on us while we're interns, and you know what? It's working.

  • Challenging work environment
  • This is the first job I've ever had where I feel like I need to work extra hours just to keep up. It's a humbling experience, and that's something I kind of needed. I feel like I'd kind of reached a plateau with my skills, and this internship is really helping me to push up to that next level.

  • Real-world impact
  • There's a good chance that some of the code I'm writing this summer will ship in the next version of Office. That's a pretty cool feeling. :D Internships here are structured the same way as projects that full-time employees do, only shorter.

  • Fully furnished apartment
  • Yeah, I could have gotten by without this, but it sure is convenient to just show up the weekend before you start work and actually have a place to sleep. Related to this: I'm living about a ten minute bike ride away from my office.

  • Novel work environment
  • This is the first time I've really done Windows-style development, and since I left my Linux box at home, it's a complete immersion sort of thing. (I know it sounds geeky, but this is the coder equivalent of going to live in a foreign country for a while.) There are a lot of things I'm learning about that I never would have gotten a chance to learn about otherwise (.NET, in particular, is freaking cool).

  • Intern perks
  • There are a lot of these, but the one that springs to mind is a free year of MSDN access. If I wanted to go and actually buy this, it'd cost something comparable to what they're paying me for the entire summer. o_o

  • Flexible work hours
  • Instead of 9-5, I work 9+X to 5+X, where X is a function of when I wake up in the morning. It is incredibly convenient. :D

  • The weather
  • I seriously cannot remember any time before coming to Washington where I could go outside during the summer and find that it's a completely pleasant temperature. Surprisingly, there aren't as many rainy days as people say to expect - I guess that's more a winter thing.

Saturday, July 10, 2010

oh no

I have forgotten how to blog D: