Sunday, June 6, 2010

Everything you need to know about Tahoe-LAFS

There are not a lot of pieces of software that I would describe as revolutionary, but Tahoe-LAFS is one of them. Unfortunately, as is all too common with open source software, the potential of the software far outstrips the documentation, with the predictable result that nobody understands what Tahoe is capable of. This post is an attempt to explain the important bits of Tahoe as succinctly as possible.

High-level goal

Tahoe is an almost-completely-decentralized, secure file store. You can think of it as a reliable black box which you can upload files and directories to. The data is highly resilient (up to 70% of the distributed data can be lost, and the file will still be recoverable, and that's just with the default settings) and secure (RSA is built right into the URLs you use to access data, so nobody who doesn't have access to the data can read it). By giving somebody a secret URL, you can give them read-only or read-write access to any file or directory. Tahoe guarantees that, if somebody does not have this secret URL, they cannot access the file data in any way.

The advantage of Tahoe over something like Amazon S3 is that you don't have to trust your cloud storage provider. S3 provides very similar capabilities, but it's a given that Amazon employees will be able to read your data, which makes it unsuitable for some applications, and which makes people that care about their privacy nervous.

When you upload a file to Tahoe, it's passed through an erasure code, and then distributed across multiple computers participating in a cluster, none of which can read the actual data. When you access data in Tahoe, a form of authentication is built into the URL you use, so you can keep data private simply by keeping the URL used to access it a secret.

Erasure coding

Wikipedia explains the theory far more faithfully than I would be able to, so I'll just try to give a more concrete example of how it works. Let's say you've got a 300 kB file. A 3-of-10 erasure code (what Tahoe defaults to) would take that file, and give you ten 100 kB chunks of data, or 1 MB total. You'd then spread this data around as widely as possible, and the erasure code guarantees that given any three chunks out of the ten, you can reconstruct the original data. This means that if you gave data chunks to ten different computers, and seven of them crashed, you'd still be able to access your data just fine.

Erasure coding by itself isn't secure, so Tahoe encrypts the data before the erasure coding step. This makes it possible to give a certain limited capability (a "verify cap", explained later) to an untrusted machine with a lot of bandwidth, which can then download some of the encrypted data, rerun the erasure code to fill in any missing data, and re-upload the complete coded file - all without being able to read the actual file data.

Introducers

The introducer is the one part of a Tahoe cluster that isn't decentralized. A network is built around an introducer, which does what it sounds like; it introduces Tahoe nodes in a cluster to one another. If you're familiar with BitTorrent, the introducer is very similar to a BitTorrent tracker.

There's been talk of replacing introducers with a DHT network (similar to what BitTorrent does these days) but as far as I know, there's been no action on this.

Capabilities

A "capability" for a piece of data is just a very long URL that you can use to access that data in a certain way. It's called a capability because it determines what you can do with the data - a "write capability", for instance, contains enough information for you to overwrite the file with a new version, while a "read capability" will only let you read the file, but not modify it. There's also something called a "verify capability" (mentioned earlier), which gives you access to the encrypted contents of the file, and can really only be used for verifying the data. Encryption keys for the data are part of the capability itself (which is part of the reason that they're very long), so there's no need to store those separately.

Data security in Tahoe is accomplished by keeping these capabilities - "caps" - secret. If you're the only one who has the write cap to a file, Tahoe guarantees that nobody else can modify the file, and if you're the only one who has a read cap, Tahoe guarantees that nobody else can view the file. Capabilities are the only method of access control. This isn't how we're used to thinking about data security - where are the passwords? - but once you wrap your head around it, it makes sense.

Capabilities can be demoted - given a write cap, you can generate the read cap automatically, and given a read cap, you can generate the verify cap. Thus, having a certain level capability for a file also means you can grant other people the same or lower level capability. I mentioned earlier that untrusted servers can refresh your data without reading it - this is what verify caps are used for. You can make the verify cap for your data public, and anybody will be able to make sure it stays in the cloud, but nobody will be able to read it.

Directories are treated the same way as files, so you can also have read/write/verify caps for directories. I'm not entirely sure how this part works, but the read-only-ness of a directory read capability carries through - if you give somebody a read cap for a directory, all the files they can see in the directory will be read-only for them as well.

Final thoughts

So far, I've been unable to find any information about how Tahoe would scale in very large configurations (thousands to millions of storage nodes), but it doesn't seem like it would. The documentation that I've found specifies that at startup, a node connects to every running storage node that the introducer gives it, which seems like a huge scalability bottleneck. I kind of want to ask the developers if this is really as bad as it sounds, but I haven't had a chance yet.

Assuming it can (or will eventually be able to) scale, I really hope Tahoe becomes more widely used. If it was, it would be a whole new way to think about data management - like a crowdsourced, secure version of Amazon S3. As far as I know, though, there aren't any large Tahoe clusters running - their public test cluster only has two or three nodes in it, as far as I can see. (If you want to start one with me, leave a comment - I can provide an introducer and some storage to start with, but there's no point if it'll be just me. XD)

No comments: