Friday, June 12, 2009

Random idea post

Here is a thing that I think would be neat. This is just a braindump, and not a well-thought-out post, so read it as such. I am rambling but I am going somewhere with it.

Filesystem images are really useful for some stuff, but there are a lot of things that suck about them. First, it should be possible to use them as a regular user - mounting to a loopback device requires root access. Sure, you could allow suid access to mount, but then users could mount anything anywhere and basically it's a massive security hole. I'm talking about being able to mount an image in a controlled way, only visible to the current user/process/some other level of control, so that it's usable in the same way but still secure.

This could either happen in kernel space, or in userspace. Doing it in kernel space, and integrating with the current UnionFS code, would probably be a better solution overall. However, doing stuff in the kernel always has some complications, since upgrading the kernel is a pain for a lot of users (that reminds me, 2.6.30 is out!), and it doesn't really help people on other operating systems. The other way to do it would be to write wrappers around the existing filesystem API, and intercept the necessary commands to pretend that there's a filesystem mounted, all in userspace. This is extremely possible, but so much code depends explicitly on the existing filesystem API that I don't know how feasible it'd be for any nontrivial projects.

Another thing that'd be cool is dynamic sizing. What I'm basically imagining here is a tar (non-UNIX people: pretend I just said zip) file, but with the emphasis on being an efficient writable image. Log-structured filesystems are a good starting point here, since you only need to be able to append (easy) and make fixed-size updates to previous locations in the file (also really easy). The problem you run into - one log-structured filesystems don't offer a solution for - is fragmentation. At some point you have to repack the archive to free up space that's used by deleted files, and if we want to optimize for performance, then that just sucks. Note to self: look at how databases handle this, they must run into similar problems.

Yet another thing that'd be cool is cheap copy-on-write clones. The more obvious way to do this is in the archive itself - make it possible to have multiple concurrent clones in a single filesystem image, and select one when you mount it. More interestingly, you could also do this by falling back on the underlying filesystem. BtrFS has an alternate cp command which allows you to create a copy-on-write "copy" of a file. If you used that to clone the archive, and were careful to keep it mostly in filesystem-sized blocks, you could have an efficient solution that lets you treat clones as separate files, keeps the archives simple, and is generally pretty neat.

Finally, filesystem-level transactions. Seriously. Why does every filesystem on earth not already have this? I should be able to specify that an arbitrary series of operations on a set of files should be treated as atomic. The current best way to simulate this is to create a new directory containing a copy of all the files for the transaction, do the updates there, and then (atomically) rename the directories. I mean, seriously, what the hell?

I'm not just saying this because it'd be cool (even though it would be), I have an application in mind. Let's say you start out with a read-only image of your root filesystem. Take the program you want to run, and stick it in a filesystem archive as described above. To run it, just mount it along with all of its dependencies, and go. All the program state can be kept inside the program itself, because we have cheap copy-on-write clones, so doing that is basically free. Now, let's say you want to transfer the (running) program to a different machine. Take it down (the data on disk can be guaranteed consistent because we have transactions), send the program and all its dependencies to the new machine (can be done efficiently - if a dependency is read-only, and the other end has an identical copy, then it can be skipped, and this can be made the common case easily enough), and restart it. Add in some magic to keep existing file handles working, and we've got cheap process migration, across heterogeneous systems, mostly in userspace.

Cool.

No comments: