Monday, July 4, 2011

Processing Wikipedia

The other week I got it in my head that it might be fun to mess around with Wikipedia. I don't just mean the website, though - I mean their entire database of articles. I've been looking around for side projects lately, and this is a perfect opportunity/excuse to write some code that operates on Big-with-a-capital-B data, which should be exciting.

My day job is working on Microsoft Office. Now, don't get me wrong, it's interesting work... but the thing about being a junior dev on Office is that I spend most of my time writing extremely robust solutions for extremely small problems. I want a side project so that I can maintain a sense of perspective about programming, so that I don't get stuck in a rut. So, it's time for me to do something completely ridiculous. Let's datamine Wikipedia! :D

Getting started

Wikipedia posts dumps of their entire database online every once in a while. The latest one is from about two weeks ago, which is plenty fresh enough for my purposes. I'm using the pages-articles.xml.bz2 file, which is about 6.5 GB compressed, and 26 GB uncompressed. It contains just shy of nine and a half million articles. And it's a single XML file. Ouch. :(


There are probably text tools for Windows that are able to handle files that are larger than memory, but I'm not sure where they're hiding. So, I'm using Cygwin and UNIX tools for pretty much everything here; when in doubt, go old-school. "less" will get me a general sense of what the input data looks like pretty much instantly.

It looks like Wikipedia uses a consistent XML scheme for all their types of database dumps. I don't think it's documented anywhere, and the link in the XML to the schema is broken, but whatever. I can probably figure it out. >_> Anyway, it's not like their XML actually validates against any schema, as we'll see in a bit.

Splitting things up

A single XML file is not a terribly useful way to package up a database dump, but I'm sure they have their reasons. Before we can do anything with this data, we'll have to split it up into more manageable chunks. The structure I'm going for is one directory per article, with a single XML file in each directory. This makes it easy to find and read any article, and also leaves an easy way to add extra data to each article. It might be useful, for instance, to have the text of the article available without any extra formatting, or to have a list of links from the article, or things like that.

I wrote a python script to split the file into articles. It's available on the bitbucket page for this project. I decided not to use an XML parser to split up the file, and instead cobbled something together that just does substring searches to find article boundaries. This is generally a stupid thing to do, but streaming XML parsers are a pain, and this file is a hundred times bigger than I'd feel comfortable sticking in a DOM parser. (That said, I ended up using a real XML parser to get the article names out of the individual pages.)


Not using a real parser turned out to be a good move, because as it turns out, the database dump isn't actually valid XML all the way through. In the dump I happened to grab, there are six places where a page is truncated, including the closing tags. (Things may be better or worse in other dumps, but I really don't feel like grabbing another few-dozen-gigabyte dump file to find out.) If I were using a real XML parser, this is about where I'd give up, but with the script I wrote it was pretty straightforward to work around. We can figure out that we have a broken page when we see an unexpected opening page tag, dump what we have in the buffer to a special location to manually fix up later, and continue parsing from the next page.

Right now, the script is running at a bit over ten thousand articles per minute. I don't know if I can make that much faster, since it seems like the filesystem is the bottleneck at this point, but whatever. Once I've got a preprocessed copy of the database, the next steps will be to parse the articles and pull out a list of links, and to use that data to solve the "six degrees" problem. Stay tuned!

Tuesday, June 21, 2011

Thinking with Portals

Portal is a fun little puzzle game, wherein you can create little wormholes that connect different places and walk through them. But it's also so much more! The puzzles by themselves would qualify Portal as a pretty neat game, but it's the writing and the story and the general atmosphere that make it one of the best (and most-quoted) games of the last decade.

I'm a couple years late to be singing Portal's praises, though; it came out in 2007. So why am I writing about it now?

[Spoilers follow. If you haven't played Portal, GO DO THAT RIGHT NOW OMG. No spoilers for Portal 2, so if you haven't played that one yet you're still good to keep reading.]

Actually, I've been thinking a lot about just what made Portal so awesome ever since I played through it a few weeks ago. (In general, I think it's worthwhile to examine awesome things!) One thing that's really struck me in retrospect is that the whole game is permeated by this incredible level of cognitive dissonance. I don't think it was an accident, or some happy quirk of the way they ended up writing it. It happens on so many levels, and in so many ways, that I think they must have planned it that way, or at least made a conscious decision about it at some point.

Example: turrets. There are automatic gun turrets sprinkled through the later levels, and they are adorable. They have little singsong voices, and they say things like "Are you still there?" and "Could you come over here?", and when you 'kill' them by knocking them over, they sometimes say "I don't hate you..." in a dejected little voice. You can't help but fall in love with them, and feel guilty about taking them down, even though they will basically kill you on sight. Relevant.

Example: GLaDOS. You start the game taking orders from a friendly-if-slightly-glitchy computer, and she gradually develops over the course of the game into the lying, omnicidal GLaDOS that we all know and love. GLaDOS herself is really a fine example of cognitive dissonance, on top of the unease she induces as she becomes more and more unhinged. She guides you through the levels with friendly-if-unsettling comments, and even goes as far as trying to be nice during The Escape, but it becomes extremely clear after a certain point that she wants nothing more than to kill you and be done with it. (The light at the end of that tunnel? It's not a cake, I'll tell you that.)

"Remember when the platform was sliding into the fire pit and I said 'Goodbye' and you were like 'No way!' And then I was all 'We pretended we were going to murder you?' That was great."

Example: your Weighted Companion Cube. Here, have a cube; this one is special! It has got hearts on it. Got a puzzle that needs solving? Your cube is part of the solution. Yep, it's just you and your cube, alone against the world! What's that, you've finished this level? Time to euthanize your only friend, then! Here, use this conveniently-placed incinerator. D: (Honestly, though? The very fact that they manage to make you form an emotional bond with a box with hearts on the sides is just another example of how masterfully they are messing with your head.)

Example: Still Alive, by Jonathan Coulton. This song plays during the end credits of the game, and is generally awesome. It's later than I planned to stay up writing this, so rather than explain why, I'll just refer you to its lyrics. <_<

So back to the title of this post, Thinking with Portals. Usually it's used in the context of "Now you're thinking with portals!", meaning that you're learning to use some of the cleverer tricks you can do with portals to solve puzzles. But can we interpret it another way? What if you actually could use portals to navigate your headspace; jumping between disparate and occasionally contradictory ideas, connecting things that would never normally be connected, being mentally in two places at once. Sound familiar?

Something to ponder.

Thursday, February 10, 2011

I am done beating this horse. It is dead.

You know what? I can handle it when mainstream media outlets call Anonymous a "hacking group". Everybody knows they don't exactly have a ready supply of technical expertise on hand; they probably couldn't tell an ethernet crimper from a can opener. We expect that from them. It's even sort of adorable sometimes, like seeing an elderly person trying to use a mouse like a foot pedal. But the point is, they don't know any better.

But when Ars Fucking Technica writes a long investigative piece about how dangerous it is to mess with Anonymous, because Anon is a Legitimate Group of Serious Hackers, that's just shameful. They, of all people, should know better. Nothing about Anonymous is secret except their names, and everybody who cares to ask knows that they're a bunch of angry teenagers with computers. But that doesn't make for a very good story, so now we have the tragic tale of Aaron Barr, who came too close to the leaders of Anonymous and got burned.

You know what I see when I read this story? A slightly deranged man, who happens to head a private security firm, decides to pick a fight with a group that's known mostly for their skill at harassment. His boss thinks he's going off the rails, his employees think he's completely deranged, but he presses on. Eventually a hacker shows up, exploits a SQL injection to get access to his company's internal network, and then passes out his passwords to Anonymous, who goes and has a field day, since that's what they do. However, his story happens to involve the major players of the Wikileaks scandal, so people pick it up as a useful proxy for real news.

So fine. I give up. If Ars Technica has decreed that Anon is a Serious Hacking Group, then that's it. Hacking is dead, nobody remembers what it means anymore. Long live Anonymous. :(

Sunday, January 30, 2011

Chrome, H.264, and 2015

Google ruffled a lot of feathers a few weeks ago when it announced that it would be dropping H.264 support from Chrome. In the short term, it seems like a ridiculous move - one guaranteed to kill HTML5 video, and keep Flash dominant for the foreseeable future - but come 2015, I think this move will make a lot of sense.

Some background on video codecs: A video codec is a method of compressing video. The measure of a codec is how well it compresses, and what sort of tradeoff you have to make between file size and image quality. Any codec can give you perfect video if you give it enough space to work with, but H.264 is one of the best when it comes to giving you high quality video at reasonable bitrates. Other prominent codecs for web streaming include Theora (which is open-source and royalty-free, but doesn't compress as well as H.264), Dirac (an experimental codec developed by the BBC, which has yet to gain any traction), and VP8 (Google's codec, bought from On2, and part of WebM). H.264 is dominant right now because it gives good results, is widely implemented, and because hardware accelerated decoders are available (crucial for mobile devices).

The HTML5 video tag doesn't specify any required video codecs, so sites are responsible for using one that all their users are able to use. Last year there was a huge shootout between proponents of H.264 and Theora. H.264 is technically the better solution, but it's owned by the MPEG-LA, and they intend to start charging royalties for it in 2015. Right now, though, H.264 seems to be winning - it's backed by Microsoft and Apple, and it's actually the only way to get video to play on iOS devices.

This is a problem for Google: YouTube would be hit pretty hard by the license fees, since every video on YouTube is streamed in H.264. Google doesn't plan to take this lying down, though. If Theora isn't up to the job, then Google will buy a codec that is competitive with H.264, and release it royalty-free for anybody to use. If it looks like H.264 is still winning, then Google is willing to drop H.264 from Google Chrome (just over 10% of the browser market) to try to kill its momentum, and replace it with Google's own codec.

MPEG-LA was originally set to start charging royalties for web streaming H.264 in 2010, but they moved it back to 2015 in order to allow H.264 to become more solidly entrenched. People say that Google is being shortsighted by promoting their own codec, since they'll never make any progress when IE9 and Safari (and iPhones!) support H.264 by default. Google doesn't need to win, though - they just need a solid alternative to H.264 to exist by 2015, so that MPEG-LA is in a weaker position when it comes time to work out what YouTube has to pay for using H.264. That's their real game here, and so far it seems like they're going to make it.

Sunday, January 9, 2011

Using "make" as a download manager (or: how to parallelize EVERYTHING)

I find myself in the somewhat tedious position of having to download about a hundred large files from S3. I need to only download certain files, so I can't even use S3's APIs and pull down the entire bucket. What a pain!

The first thirty or so, I did by copy/pasting the URLs onto a command line with wget, so that I could at least download them in batches. This worked, but was slower than necessary: S3 occasionally gives you a slow download, so that every once in a while I'd get a file that took ten times longer than the rest, holding everything up. You know what'd really be great? If I could download with wget, but parallelize it, so that several files were downloading at once.

There are a few different options here. I could get a download manager, which would do what I want but requires me to install some random program that I'll never use again. I could use the solution presented in this blog post (first hit for "wget parallel downloads"), but that's still too much code. Or, I could use this nifty UNIX trick for parallelizing anything. :3

The code

First, write a Makefile that looks like this:
%::
        wget -nv -c $@

Save that as "Makefile", of course. Then, in the same directory, run this:
make -j3 [URL1] [URL2] [URL3] ...

...and all the URLs you specify will be downloaded, with three downloads going at a time.

But how does it work?

Not everybody knows this, but make includes a really neat dependency-aware parallel work queue. If you give it a really long list of jobs, where some of them depend on others, make can do them in parallel, while keeping track of what depends on what and making sure that dependencies happen before things that depend on them. (In this case, though, we have no dependencies to express, so the Makefile is really simple.) The "-j" option controls the number of jobs that make will run in parallel.

Our Makefile consists of a single wildcard rule (%::), matching any input, and calls wget with whatever we pass in ($@ - clearly whoever wrote make thought Perl-style variables were a good idea :p). make works by taking its command-line parameters, finding a rule that matches each one of them, and executing that rule. So all we have to do is run make with a bunch of URLs as parameters; it will look for a file called Makefile (the default), find a rule in it that matches the URLs (the wildcard rule, in this case), and execute that rule (downloading the file).

Variant: what if I have a list of URLs

The coolest thing about the command line is the ease with which you can combine programs. In this case, let's combine the above hack with the "xargs" program, which translates streams of lines to command-line arguments. If you replace the command line above with this:
xargs make -j4 < list-of-urls.txt
...then you can download every URL in a list, in parallel. (As an aside, xargs has all the necessary logic to work around the maximum number of command line arguments you can use at a time, so this should work with an unlimited number of URLs.)

But how do I parallelize EVERYTHING?

Up to you! This technique works in a surprisingly wide array of situations, if you're using the command line. The last time I used it, it was batch-converting a lot of images - imagemagick doesn't use multiple CPUs, but using make let me run multiple instances of it at once, so that I finished twice as fast. (My downloads have all finished, though, so the necessary changes to the Makefile are left as an exercise to the reader. :3)

Sunday, December 5, 2010

Adventures in UNIX: Pipe buffer edge cases, half-open sockets

So yesterday I had a really neat idea: exposing pipe-based programs as network services. You could open a connection to a program, send it data, and the remote computer would put it through some predefined command and send it back. Then I realized that with IPv6, you could give each program its own IP address, and give them all names in the DNS, so that you'd have a perfectly usable system without using any higher-level protocols than DNS and TCP. Then I realized that this would be simple enough that you wouldn't even need to write any code to make this work - it's all doable with simple shell commands! (I was wrong about this last one, and that's the topic of this post.)

(By way of comparison: This is sort of an inversion of the Plan9 model of network computation. Instead of mounting a remote filesystem and piping it through a local program, you're piping local data through a remote program.)

Pipes are usually a one-way structure, but for this to work properly, I needed something a little more exotic. I need to be able to take output from a command, and pipe it back around to the beginning of the pipe, so that the pipe has a loop in it. If I could do that, I could combine netcat with any command, and that'd be a one-liner that implements a server. :D

So here's the first thing I tried.

Server:
mkfifo t
while true; do nc -l 127.0.0.1 9999 < t | tr a-z A-Z >> t; done
Client:
nc 127.0.0.1 9999 < testdata

In a perfect world, this would work! But here's where we get into the details of pipe buffers.

The first problem with this is in the server. When you use a pipe, data's actually buffered along the way, in the commands that are being piped. Normally, this is transparent, because the buffers are flushed out when the previous command exits. This doesn't work when you have a loop through a fifo, though! The data that's buffered in the tr command doesn't get flushed to the fifo until netcat exits, and so netcat never actually has the chance to send the tail end of the data. The only way I could think of to solve this was to write some code for the server - pretty disappointing, but probably necessary. (I'm not going to post that code here, because it's even messier than being a prototype should justify. >_>)

But that's not all - it turns out the client part is broken, for a completely different reason. When you give netcat an EOF (Ctrl-D), it doesn't know how to tell the remote side of the connection that there was an EOF. The server then doesn't have any way to know when to flush the buffers out and end the command, so the whole thing deadlocks waiting for more input that's never coming.

It turns out that TCP solves this problem; the bug is in netcat. With a TCP socket, you can close one direction of traffic, but keep using the other - for example, when you're done writing data to a socket, you can shut down the socket for writes, which signals to the remote side that you're done writing, and then read whatever the server sends back. This, unfortunately, required more code.

netpipe.py:
import sys
addr = sys.argv[1]

import select
def attempt_read(s, BUF_SIZE):
    if select.select([s], [], [], 0)[0]:
        return s.recv(BUF_SIZE)
    return ''

import socket
s = socket.create_connection((addr, 9999))

BUF_SIZE = 4096
buf = sys.stdin.read(BUF_SIZE)
while buf:
    s.sendall(buf)
    
    sys.stdout.write(attempt_read(s, BUF_SIZE))
    sys.stdout.flush()
    
    buf = sys.stdin.read(BUF_SIZE)

s.shutdown(socket.SHUT_WR)

buf = s.recv(BUF_SIZE)
while buf:
    sys.stdout.write(buf)
    sys.stdout.flush()
    buf = s.recv(BUF_SIZE)

(This is trivial enough that I'm planning to port it to C soon.)

Finally, some good news: this works perfectly! :D With this, you can open a connection and use it as a component in a pipe.

Wednesday, December 1, 2010

p2p DNS

Now that the US is considering forcing pirate domain names out of the DNS, one of the founders of The Pirate Bay is floating the idea of a p2p DNS alternative.

Okay, wow. This is an incredibly terrible idea.

I'll start with the obvious objections:

  • The DNS is meant to be authoritative
  • In a p2p system, you don't know who you can trust, because everybody else is just a peer. The DNS is completely useless if the results you get back aren't authoritative. Some people are proposing web-of-trust type solutions, or other idiocy. NO. Web-of-trust doesn't scale, and requires too much human maintenance to ever work. Even being able to compute some kind of transitive trust metric is an open research question, and then there's the so-far-intractable problem of picking a trust metric. Any answer you get from a p2p DNS system will be unreliable.
  • The DNS is meant to be reliable
  • DNS is meant to be a transparent layer, when you're using the Internet. It's something that you just sort of expect to work, and bad stuff happens when it doesn't. And the thing about p2p systems is, it's actually pretty near impossible to make any guarantees at all about their behavior. I've actually read a lot of papers about building distributed storage systems. And you know what? Nobody's ever actually managed to get anything better than a relatively weak statistical guarantee about any property of a p2p storage system. For the DNS, that's simply not good enough.
  • Performance
  • The DNS has pretty tight performance constraints, and p2p systems (for all their advantages) are extremely vulnerable to DoS attacks. It's pretty much inherent in their design - any p2p system will require a peer to have fairly complex communications with a lot of other untrusted peers. And, as many people have shown over the years, when you manage to take down the DNS with a (D)DoS attack, people tend to flip out.
  • Secure decentralized systems are HARD
  • Look, it's not like it's impossible for random people on the Internet to band together and write a program. It's not even that difficult; open source has proven that. What is hard is getting random people together to solve a fundamentally hard problem in computer science. Let me put it this way. If a well-respected professor of computer science were to propose a p2p DNS system, I would treat it with heavy skepticism. If Peter Sunde proposes it, and expects the Internet hivemind to just sort of blast through all the hard problems by sheer virtue of wanting torrents, then I just laugh. (And then, if it looks like people are taking him seriously, I write a blog post like this.)

There are some people whose first reaction to any data management problem is to try to stick it in a magic DHT and forget about it. In many cases it works - see BitTorrent, for example. A DHT will work in any application where you don't especially need data to be reliable or trustworthy; it's a perfect fit for BitTorrent peer exchange, where reliability is optional because the DHT is only a backup for the real tracker, and trustworthiness doesn't matter because the peers aren't trusted in the first place. For the DNS, though, a DHT is exactly the wrong solution.

It may be possible, someday, to fully decentralize the DNS. To do it will take some fundamental advances in computer science, though, and Peter Sunde isn't going to be able to make that happen by rallying the pirates to his cause.