Thursday, September 24, 2009

Fun with find

The find command is incredibly useful, if a bit arcane, so it's a shame that more Linux users aren't aware of it. Basically, if what you're trying to do sounds like "search for files with these characteristics, and do something to them", then find is probably the tool you're looking for.

Let's start with the very basics. If you run find with no arguments, it'll start listing all the files it can find. This isn't especially useful behavior, but let's take a closer look at it. find takes a list of paths, and looks at all files contained in each of them - if you don't give it any, it'll assume the current directory. We can introduce a bit of bash-fu to start doing something that looks like it might be useful:

$ find ${PATH//:/ }

This will list all the programs that are available on your path. (The weird-looking variable reference just replaces colons with spaces - look here if you're curious.)

Just listing files is no fun, though. You can do that already, with ls, which also has the advantage of being a few less letters to type! So let's start getting into the real power of find, with expressions.

After the paths to search on, you can specify any number of expressions - basically, filters that look at the list of files and only select the ones matching some criteria. One simple one is -name:

$ find /usr/portage -name ChangeLog | wc -l

(If the bar thing looks funny to you, you need to read up on pipes. If you don't know pipes, you can't really say you know how to use the command line, they're that important.) This is a command I used just a few hours ago, to find out how many ChangeLog files there are in Gentoo's portage tree. Without the find command, this would have been kind of a pain. find also has a -iname filter, that does a case-insensitive match - useful if you're looking for files that have inconsistent capitalization.

There are a lot of other possible filters, too many to list here, so you'll have to read the find man page to see them all. Here are just a few examples:

$ find ${PATH//:/ } -name "mkfs.*"
This is the earlier example, but with a twist - this prints out the full path to programs matching a given pattern. (If you only want one program, the which command is easier, though.)

$ find ~ -empty
This lists all empty (zero-length) files in your home directory.

$ find / -user root
This will list all files owned by root. (You probably have to be root for this to actually list all of them, for obvious reasons.)

$ find / -size +500M
This finds all files on your system larger than 500 megabytes, and requires some explanation. Filters that take numerical arguments can usually also take a + or - modifier, to mean "greater than this" or "less than this". If you leave it out, then you can search for files that have some exact size.

$ find ~ -mmin -30
List all the files in your home directory that were modified in the past 30 minutes. (No more wondering about where you saved that important file!)

$ find /usr/bin -not -executable
There shouldn't be any non-executable files there, but I found one on my system - probably a bug in the package that installed that file. (Want more logical operators? You can stick a -or between two filters and find will return the file if it matches either of them.)

"But wait," you might be thinking. "You said find would look for files and let me do stuff to them, but listing them isn't terribly interesting!" Don't worry, the fun is just beginning. :D

The simplest way to get find to do stuff with files is not to use find at all: pipe the output to xargs instead. For most simple tasks, this is way easier than using find's execution capabilities. The following three commands do basically the same thing:

$ find ~ -size 0 | xargs rm
$ find ~ -size 0 -exec rm "{}" +
$ find ~ -size 0 -delete

The first one just pipes the list of files to xargs, which is a nifty little utility that runs the command it's given on each filename it gets through the pipe. In this case, it runs rm and deletes all the files it's passed, but you could use any command there.

The second one uses find's -exec option, which gives you more control over how the command is constructed. After the -exec, you find rm, which is pretty self explanatory - it's the command you're executing. The "{}" thing is find's weird way of saying "the file that was found" - this is where the filename gets substituted into the command. The + ends the command, but there's a twist here. If you end the command with a semicolon instead, find runs the command once for each file. (NB: you have to put the semicolon in quotes or bash messes with it. This took me forever to figure out :( ) If you use +, on the other hand, it has the same effect as far as terminating the command, but it also tells find to jam as many filenames as it can in there, subject to whatever limitations the OS imposes. For large file lists, this can be the difference between your command running thousands of times or just a few times, so using + wherever possible is a good habit to get into.

The third is mainly for completeness - find has a builtin function for deleting files, making this example a bit pointless. :)

Here are a few more practical examples.

$ find /usr/portage -name ChangeLog -exec du -c "{}" + | grep total
I used this to find the total disk space on my system taken up by ChangeLog files in the portage tree. "du -c" will print out the total disk space used by all the files you give it, and the grep filters the output down to just those totals.

$ find -type d -exec chmod 755 "{}" +
Somehow I had a pile of directories on my NFS share that had no execute permissions for all users, so other users couldn't even enter those directories. This fixed all that in a single command.

$ find ${PATH//:/ } -perm -4111 -user root
Shows you all binaries available on your path that are suid root. These can be serious security risks if the programs are written insecurely.

$ find -mtime +365 -exec mv "{}" archive/ +
Moves all files that haven't been modified in more than a year to another directory.

$ find -nouser -exec chown root "{}" + , -nogroup -exec chgrp root "{}" +
Find all files that are owned by a nonexistent user or group, and change that ownership to root. Note the comma in there; it splits up the expression so that you can operate on multiple sets of files in a single find command, and only have to actually scan the directory tree once. If you want to do something like this in the absolute fastest way, find is your friend.

That's about the limit of my knowledge, but the find man page has loads more information, as well as some more examples.

Friday, September 18, 2009

Friday, September 11, 2009

The coming real-time web

It all started in 2001, as fas as I can tell, in the early days of RSS. There was an optional part of the RSS spec - the <cloud> tag - which could do some neat stuff. It specified a protocol by which a subscriber to an RSS feed could also get near-instantaneous notifications whenever a feed updated, via a "cloud" server. Unfortunately, this part of RSS never really caught on, and without support from any feeds or software, it was basically dead in the water, and everybody kind of forgot it existed.

Fast forward to early summer of this year...

Sometime at the beginning of this summer, in May or June, Google announced their protocol for real-time notifications, called PubSubHubbub. (PubSub is shorthand for publish/subscribe, I guess. Also, you can acronym it to PuSH, and it uses push notifications. Neat!) PubSubHubbub is a really nice piece of design work - I can almost smell the scalability when I read it. (But more on the design issues later.) A bit later, in mid-July or so, Dave Winer (whose blog I follow) decided that it'd be neat to get the old RSS cloud running again. He starts writing code, and putting up test feeds, and people start showing up to the party.

Fast forward to today...

PubSubHubbub is live on LiveJournal and Blogger (including this blog!), while rssCloud is live on Wordpress. The idea has been around for years, but suddenly, in the past month or two, real-time notifications for blogs have taken off in a big way. It's no fun if no aggregators support them, of course, which is why I implemented rssCloud support in my personal aggregator this week. (Yes, I wrote my own aggregator. All the other ones I tried out sucked. >_>) PubSubHubbub support is coming later, once I work out all the current bugs.

Design Issues

Here's a quick rundown of how RSS <cloud> and PubSubHubbub work. Unless you care deeply about how RSS and Atom feeds work, you can probably skip this section. :) I'm only covering the subscriber side of the protocols here, because that's all I really care about.

Broad overview: Both protocols are detectable in feeds, and both can work with either RSS or Atom feeds. When a subscriber detects a real-time protocol in a feed, it can notify (and periodically re-notify) the specified "hub" server that it wants to receive updates. Later, when an update happens, the hub server sends a notification to each subscriber.

In rssCloud, subscribers can detect the <cloud> tag in an RSS feed. PuSH, on the other hand, uses a standard <link> tag, with attribute rel="hub". They both contain the URL of a hub server that the client can contact. Here's the first major philosophical difference between the two: rssCloud allows several communication protocols (HTTP POST, XML-RPC, SOAP), while PuSH only uses HTTP POST. Standardizing on one protocol takes some flexibility away from implementors, but also simplifies things since they only have to worry about one protocol. Overall, I'd say this is a win for PuSH.

The client sends a notification to the server that the client would like to be notified of changes, using the specified method. In PuSH, the client needs to specify a URL that it can be reached at. rssCloud, on the other hand, has the server detect what the client's address is based on where the request came from. This is incredibly convenient for clients behind a NAT, since they have no good way of knowing their own address.

PuSH also allows the client to specify a "secret", which the server can then use to send authenticated updates to clients. This brings us to what's probably the most fundamental difference between the two specs: In PuSH, updates from the hub are authoritative - they contain actual feed content that clients can use, bypassing the feed entirely. In rssCloud, on the other hand, notifications just tell clients to update the feed, so the notifications themselves don't have to be trusted. This greatly simplifies implementation of rssCloud, but it comes at a cost. Whenever a <cloud>-enabled RSS feed updates, it's going to get hit by every <cloud>-enabled client at the same time, when they're notified. This "flash crowd" is exactly the problem that Google was trying to solve by having the notifications contain authoritative feed content. It's a tradeoff, and it's hard to say which paradigm is better - static file hosting can be made to scale, after all. Luckily, both are in competition now, so time will tell.

So, time passes, and then a feed that you're subscribed to updates! Like I said, I'm not covering the publish side of the protocol, so let's assume that the hub finds out about the update via voodoo magic. In the rssCloud world, the hub goes and sends the URL of the feed that updated to every subscriber, via whatever method they requested. If a feed is known by multiple URLs, the hub deals with that, and sends each subscriber whichever URL they asked for. Subscribers can then update the feeds immediately. In PuSH, on the other hand, each feed has a canonical URL (specified by the <link rel="self"> tag, which is mandatory in PuSH), and it sends actual feed content in the notifications, not just feed URLs.

In the end, the differences between the two specs come down to allocating complexity. PuSH is a bit more complex, but it's in ways that will help it scale better in the long run.

Implementation

I did my implementation in Python, because everything I write for fun these days is in Python. My first shot at it was Tuesday night, and it started working in the wee hours of the morning, but the design was freaking terrible. >_> They say that you should throw away your first implementation of anything you write, and that certainly applied here. Wednesday afternoon I took another shot at it, got the design down, tweaked the design so that there'd be as little duplication of effort as possible when I implement PuSH later on, and got it working by Wednesday night.

My implementation uses XML-RPC, because gosh, it's just so easy to use in Python. I ended up having to add support for HTTP requests, because it seems like all the feeds available online use those exclusively, go figure. I probably could have gotten better results by using the excellent Twisted library, but I decided I'd rather not add any dependencies to my reader, so I just used the standard Python libraries instead.

Future directions

This is neat and all, but what are the consequences of this technology going to be? Dave Winer seems dead set on creating a completely decentralized Twitter clone using RSS, with real-time notifications using <cloud>. It'll be interesting if he manages it, but I really think that cloning Twitter is setting the bar too low. :p It's a natural goal, though - Twitter is more fast-paced than blogs, so adding real-time notifications to blogs could very well push in the direction of shorter, more frequent posts.

Connectivity is going to be a problem with this, going forward, since every client needs to be able to accept incoming connections, which hasn't been generally possible since NAT broke the Internet. There are a few possible solutions. The best, obviously, is to just roll out IPv6, so that every client on the Internet is reachable again. Another possibility would be to have RSS "proxies" that are reachable, and that communicate updates to clients by some other means. This could be clients connecting to the proxy and just getting update notifications as bare URLs in a stream, or alternately it could just throw up a static RSS feed that clients can poll more frequently. Both approaches have their pros and cons, but I personally favor the former.

In order to keep up with the accelerating pace of feeds, I predict that the more advanced RSS clients are going to include interfaces to popular blogging platforms, so that you can get your reactions to the news on the web more quickly (analogous to retweeting on Twitter). For this to really work nicely, we're going to need some way of tracking discussions that happen through blog posts. Automatically linking back to the post you're replying to is a trivial first step, and one that's already pretty well established by convention. Backlinks allow you to see which posts are replies to a given post, but they're not consistently used these days due to the potential for spam.

Real-time news is going to be fun. CNN is already on it; they're using Wordpress's rssCloud support to run a real-time enabled news feed. News being brought to your desktop within seconds of it being reported. This is the sort of thing the Internet is capable of giving us; it's certainly taken us a while to catch on.

Thursday, September 3, 2009

5 Things about ZFS that suck

So it's Friday now, and it turns out I completely forgot about writing a post this week. (I have two in the works, but both deserve more seriousness than I can find time for today.) Part of the reason for this is that I spent a good part of the weekend fixing my computer - in the course of getting KMS working (see previous two posts), I had to do several hard shutdowns when things didn't go right, and one of these killed my big media RAID. I keep a lot of files in that RAID, almost 700 GB, so this was kind of a big deal.

The kicker here, though, is that I was using ZFS for my RAID, precisely because it promises a hitherto-only-dreamed-of level of reliability. The idea behind ZFS is that you configure it across several hard drives, with as much or as little redundancy as you want, and it'll survive just about anything, including loss of one or more drives or random disk corruption, and keep on running through all that without a hiccup. It's an impressive filesystem, to be sure, but it's not without its flaws.

In fact, I got fed up enough with the flaws to make a list!

1) No native Linux support

Yes, I understand the licensing issues involved, and yes, I understand that the FUSE port of ZFS works pretty well on Linux, and yes, I understand that because of various design choices ZFS has almost no chance of making it into the Linux kernel. But you know what? All of these things are solvable problems, and I think it says a lot that they weren't solved a long time ago.

2) Inflexibility of disk mirrors

If you have two hard drives that are mirrors of each other, ZFS handles things well when you first set up the array. Instead of requiring that the disks be the same size, like some RAIDs, it'll just take the minimum of the two sizes, and pretend both the disks are that size. Then, if you take the smaller disk offline, the array will automatically grow to whatever size is available to it. Here's the problem, though: once ZFS is set up, you can't add a smaller disk to mirror a larger disk - not even if the data would fit on the smaller disk, not even if there's no data at all yet on the larger disk. This can lead to the somewhat ridiculous situation of detaching a disk from a mirror, and being unable to reattach it, because it's become too small.

3) Inflexibility of RAID-Z

RAID-Z is just RAID5, but in ZFS. It's the most flexible RAID option when you're setting up, but make sure you've got it configured how you want it, because once it's set up there's no way to add or remove disks in a RAID-Z. All you can do is replace them if they fail. Modifying a RAID-Z would apparently require a fairly significant change to ZFS called "block pointer rewriting". (More on that in a sec)

4) Inability to remove top-level vdevs

This is another one that can really bite you in the ass. Let's say you're trying to add a mirror to an existing drive, but you're careless and you use the "add" command instead of the "attach" command. No problem, right? No data's even hit the disk, so you should be able to just remove it, right? Wrong. Top-level vdevs are permanent in ZFS. Your only recourse if this happens is to back up all your data, destroy the pool, and then re-create it from scratch. (Guess who made this exact mistake this weekend. :( )

Apparently, removal of top-level vdevs is another thing that depends on "block pointer rewriting". Funny thing, though - even though it's necessary for features that people have been clamoring for since day 1, and despite repeated assurances from people at Sun that it's a super-high priority for them, there's been no public progress on this issue in years. (And by public progress I mean released code, or at least an official announcement. The most recent news I can find is a blog post from the end of last year. :/)

This leads into the next thing...

5) Opaque development

Maybe I'm just spoiled by Linux development, where you can literally watch changes as they go through, but I find it irritating when open source software development happens behind closed doors. Sure, I realize that Sun has legitimate business reasons for not releasing all their source code as they write it, but I think they've gone too far in that direction. It took me forever to even find their bug tracker page on vdev removal, for instance, and even that is almost completely bare of information.

Really, this is a larger problem with Sun's open source efforts, and one they've been criticized for before. They open up their source code, which is nice, but the barrier to entry for contributions is high enough that a community never really builds up, so all work ends up being done by Sun engineers in the end. This isn't usually a huge loss, since Sun has some pretty sharp engineers, but it means that their open-source efforts will never really live up to their potential.

Now, having said all this, I do have to admit that ZFS is a pretty slick piece of work - it's easily five years ahead of anything else out there in the end-user market. It's not perfect, though, and people need to remember that - for every skeptical page about ZFS on the Internet, there are dozens that are little more than cheerleading. And that doesn't help anybody.