Sunday, January 30, 2011

Chrome, H.264, and 2015

Google ruffled a lot of feathers a few weeks ago when it announced that it would be dropping H.264 support from Chrome. In the short term, it seems like a ridiculous move - one guaranteed to kill HTML5 video, and keep Flash dominant for the foreseeable future - but come 2015, I think this move will make a lot of sense.

Some background on video codecs: A video codec is a method of compressing video. The measure of a codec is how well it compresses, and what sort of tradeoff you have to make between file size and image quality. Any codec can give you perfect video if you give it enough space to work with, but H.264 is one of the best when it comes to giving you high quality video at reasonable bitrates. Other prominent codecs for web streaming include Theora (which is open-source and royalty-free, but doesn't compress as well as H.264), Dirac (an experimental codec developed by the BBC, which has yet to gain any traction), and VP8 (Google's codec, bought from On2, and part of WebM). H.264 is dominant right now because it gives good results, is widely implemented, and because hardware accelerated decoders are available (crucial for mobile devices).

The HTML5 video tag doesn't specify any required video codecs, so sites are responsible for using one that all their users are able to use. Last year there was a huge shootout between proponents of H.264 and Theora. H.264 is technically the better solution, but it's owned by the MPEG-LA, and they intend to start charging royalties for it in 2015. Right now, though, H.264 seems to be winning - it's backed by Microsoft and Apple, and it's actually the only way to get video to play on iOS devices.

This is a problem for Google: YouTube would be hit pretty hard by the license fees, since every video on YouTube is streamed in H.264. Google doesn't plan to take this lying down, though. If Theora isn't up to the job, then Google will buy a codec that is competitive with H.264, and release it royalty-free for anybody to use. If it looks like H.264 is still winning, then Google is willing to drop H.264 from Google Chrome (just over 10% of the browser market) to try to kill its momentum, and replace it with Google's own codec.

MPEG-LA was originally set to start charging royalties for web streaming H.264 in 2010, but they moved it back to 2015 in order to allow H.264 to become more solidly entrenched. People say that Google is being shortsighted by promoting their own codec, since they'll never make any progress when IE9 and Safari (and iPhones!) support H.264 by default. Google doesn't need to win, though - they just need a solid alternative to H.264 to exist by 2015, so that MPEG-LA is in a weaker position when it comes time to work out what YouTube has to pay for using H.264. That's their real game here, and so far it seems like they're going to make it.

Sunday, January 9, 2011

Using "make" as a download manager (or: how to parallelize EVERYTHING)

I find myself in the somewhat tedious position of having to download about a hundred large files from S3. I need to only download certain files, so I can't even use S3's APIs and pull down the entire bucket. What a pain!

The first thirty or so, I did by copy/pasting the URLs onto a command line with wget, so that I could at least download them in batches. This worked, but was slower than necessary: S3 occasionally gives you a slow download, so that every once in a while I'd get a file that took ten times longer than the rest, holding everything up. You know what'd really be great? If I could download with wget, but parallelize it, so that several files were downloading at once.

There are a few different options here. I could get a download manager, which would do what I want but requires me to install some random program that I'll never use again. I could use the solution presented in this blog post (first hit for "wget parallel downloads"), but that's still too much code. Or, I could use this nifty UNIX trick for parallelizing anything. :3

The code

First, write a Makefile that looks like this:
%::
        wget -nv -c $@

Save that as "Makefile", of course. Then, in the same directory, run this:
make -j3 [URL1] [URL2] [URL3] ...

...and all the URLs you specify will be downloaded, with three downloads going at a time.

But how does it work?

Not everybody knows this, but make includes a really neat dependency-aware parallel work queue. If you give it a really long list of jobs, where some of them depend on others, make can do them in parallel, while keeping track of what depends on what and making sure that dependencies happen before things that depend on them. (In this case, though, we have no dependencies to express, so the Makefile is really simple.) The "-j" option controls the number of jobs that make will run in parallel.

Our Makefile consists of a single wildcard rule (%::), matching any input, and calls wget with whatever we pass in ($@ - clearly whoever wrote make thought Perl-style variables were a good idea :p). make works by taking its command-line parameters, finding a rule that matches each one of them, and executing that rule. So all we have to do is run make with a bunch of URLs as parameters; it will look for a file called Makefile (the default), find a rule in it that matches the URLs (the wildcard rule, in this case), and execute that rule (downloading the file).

Variant: what if I have a list of URLs

The coolest thing about the command line is the ease with which you can combine programs. In this case, let's combine the above hack with the "xargs" program, which translates streams of lines to command-line arguments. If you replace the command line above with this:
xargs make -j4 < list-of-urls.txt
...then you can download every URL in a list, in parallel. (As an aside, xargs has all the necessary logic to work around the maximum number of command line arguments you can use at a time, so this should work with an unlimited number of URLs.)

But how do I parallelize EVERYTHING?

Up to you! This technique works in a surprisingly wide array of situations, if you're using the command line. The last time I used it, it was batch-converting a lot of images - imagemagick doesn't use multiple CPUs, but using make let me run multiple instances of it at once, so that I finished twice as fast. (My downloads have all finished, though, so the necessary changes to the Makefile are left as an exercise to the reader. :3)