Sunday, January 9, 2011

Using "make" as a download manager (or: how to parallelize EVERYTHING)

I find myself in the somewhat tedious position of having to download about a hundred large files from S3. I need to only download certain files, so I can't even use S3's APIs and pull down the entire bucket. What a pain!

The first thirty or so, I did by copy/pasting the URLs onto a command line with wget, so that I could at least download them in batches. This worked, but was slower than necessary: S3 occasionally gives you a slow download, so that every once in a while I'd get a file that took ten times longer than the rest, holding everything up. You know what'd really be great? If I could download with wget, but parallelize it, so that several files were downloading at once.

There are a few different options here. I could get a download manager, which would do what I want but requires me to install some random program that I'll never use again. I could use the solution presented in this blog post (first hit for "wget parallel downloads"), but that's still too much code. Or, I could use this nifty UNIX trick for parallelizing anything. :3

The code

First, write a Makefile that looks like this:
%::
        wget -nv -c $@

Save that as "Makefile", of course. Then, in the same directory, run this:
make -j3 [URL1] [URL2] [URL3] ...

...and all the URLs you specify will be downloaded, with three downloads going at a time.

But how does it work?

Not everybody knows this, but make includes a really neat dependency-aware parallel work queue. If you give it a really long list of jobs, where some of them depend on others, make can do them in parallel, while keeping track of what depends on what and making sure that dependencies happen before things that depend on them. (In this case, though, we have no dependencies to express, so the Makefile is really simple.) The "-j" option controls the number of jobs that make will run in parallel.

Our Makefile consists of a single wildcard rule (%::), matching any input, and calls wget with whatever we pass in ($@ - clearly whoever wrote make thought Perl-style variables were a good idea :p). make works by taking its command-line parameters, finding a rule that matches each one of them, and executing that rule. So all we have to do is run make with a bunch of URLs as parameters; it will look for a file called Makefile (the default), find a rule in it that matches the URLs (the wildcard rule, in this case), and execute that rule (downloading the file).

Variant: what if I have a list of URLs

The coolest thing about the command line is the ease with which you can combine programs. In this case, let's combine the above hack with the "xargs" program, which translates streams of lines to command-line arguments. If you replace the command line above with this:
xargs make -j4 < list-of-urls.txt
...then you can download every URL in a list, in parallel. (As an aside, xargs has all the necessary logic to work around the maximum number of command line arguments you can use at a time, so this should work with an unlimited number of URLs.)

But how do I parallelize EVERYTHING?

Up to you! This technique works in a surprisingly wide array of situations, if you're using the command line. The last time I used it, it was batch-converting a lot of images - imagemagick doesn't use multiple CPUs, but using make let me run multiple instances of it at once, so that I finished twice as fast. (My downloads have all finished, though, so the necessary changes to the Makefile are left as an exercise to the reader. :3)

1 comment:

Ole Tange said...

GNU Parallel http://www.gnu.org/software/parallel/ is made for paralellizing jobs.

cat list | parallel -j4 wget

If you want to download a huge file in parallel this contruct may be helpful:

seq 0 99 | parallel -j4 -k curl -r {}0000000-{}9999999 http://example.com/the/big/file > file

Watch the intro video to learn more: http://www.youtube.com/watch?v=OpaiGYxkSuQ