The other week I got it in my head that it might be fun to mess around with Wikipedia. I don't just mean the website, though - I mean their entire database of articles. I've been looking around for side projects lately, and this is a perfect opportunity/excuse to write some code that operates on Big-with-a-capital-B data, which should be exciting.
My day job is working on Microsoft Office. Now, don't get me wrong, it's interesting work... but the thing about being a junior dev on Office is that I spend most of my time writing extremely robust solutions for extremely small problems. I want a side project so that I can maintain a sense of perspective about programming, so that I don't get stuck in a rut. So, it's time for me to do something completely ridiculous. Let's datamine Wikipedia! :D
Wikipedia posts dumps of their entire database online every once in a while. The latest one is from about two weeks ago, which is plenty fresh enough for my purposes. I'm using the pages-articles.xml.bz2 file, which is about 6.5 GB compressed, and 26 GB uncompressed. It contains just shy of nine and a half million articles. And it's a single XML file. Ouch. :(
There are probably text tools for Windows that are able to handle files that are larger than memory, but I'm not sure where they're hiding. So, I'm using Cygwin and UNIX tools for pretty much everything here; when in doubt, go old-school. "less" will get me a general sense of what the input data looks like pretty much instantly.
It looks like Wikipedia uses a consistent XML scheme for all their types of database dumps. I don't think it's documented anywhere, and the link in the XML to the schema is broken, but whatever. I can probably figure it out. >_> Anyway, it's not like their XML actually validates against any schema, as we'll see in a bit.
Splitting things up
A single XML file is not a terribly useful way to package up a database dump, but I'm sure they have their reasons. Before we can do anything with this data, we'll have to split it up into more manageable chunks. The structure I'm going for is one directory per article, with a single XML file in each directory. This makes it easy to find and read any article, and also leaves an easy way to add extra data to each article. It might be useful, for instance, to have the text of the article available without any extra formatting, or to have a list of links from the article, or things like that.
I wrote a python script to split the file into articles. It's available on the bitbucket page for this project. I decided not to use an XML parser to split up the file, and instead cobbled something together that just does substring searches to find article boundaries. This is generally a stupid thing to do, but streaming XML parsers are a pain, and this file is a hundred times bigger than I'd feel comfortable sticking in a DOM parser. (That said, I ended up using a real XML parser to get the article names out of the individual pages.)
Not using a real parser turned out to be a good move, because as it turns out, the database dump isn't actually valid XML all the way through. In the dump I happened to grab, there are six places where a page is truncated, including the closing tags. (Things may be better or worse in other dumps, but I really don't feel like grabbing another few-dozen-gigabyte dump file to find out.) If I were using a real XML parser, this is about where I'd give up, but with the script I wrote it was pretty straightforward to work around. We can figure out that we have a broken page when we see an unexpected opening page tag, dump what we have in the buffer to a special location to manually fix up later, and continue parsing from the next page.
Right now, the script is running at a bit over ten thousand articles per minute. I don't know if I can make that much faster, since it seems like the filesystem is the bottleneck at this point, but whatever. Once I've got a preprocessed copy of the database, the next steps will be to parse the articles and pull out a list of links, and to use that data to solve the "six degrees" problem. Stay tuned!