Thursday, August 6, 2009

Data Mining Post x2

1. Netflix

The Netflix prize competition finished up recently! This is of special interest to me, because the competition was the final project in the data mining course that I took recently. Basically, Netflix has some software (Cinematch) that predicts what movies people will like, based on what movies they've liked in the past. The competition was to write a program which makes these predictions with 10% greater accuracy than Cinematch, with error measured as the RMSE. The prize is one million dollars (!!!), which is enough to spur some serious data-mining research into the subject.

The contest ended when one team (BellKor's Pragmatic Chaos) achieved the 10% goal. This started a month-long countdown, and there was a bit of back-and-forth and drama between BKPC and another team (The Ensemble), as they both tried to hold the #1 spot at the moment the competition ended. The Ensemble took the spot with a better result with only one day to go, but BKPC took it back with just 20 minutes to go. The Ensemble then submitted a better entry in the final few minutes of the competition, but it looks like that one wasn't accepted for some reason, because at this point it seems like BKPC has won.

A really interesting thing that emerged as the competition got to the later stages was the merging of teams. BKPC is a merger of BellKor in BigChaos and Pragmatic Theory. BellKor in BigChaos is a combination of BellKor and BigChaos, and team BellKor was itself a collaboration. On the other side, The Ensemble was a combination of the second- and third-place teams when BKPC broke 10%, and both those teams were also merged teams. If I were a lot less lazy, I'd put together a flowchart of which teams ended up where - it'd definitely be fun to look at.

(Hint for companies looking for algorithmic improvements: A huge pile of prize money is probably a cheaper and more reliable way of getting results than actually paying for that kind of research. Also it's a lot more fun for everybody involved!)

The actual data provided was split into two sets - one massive set (100 million ratings) to be used as input for the algorithm, and a much smaller set(a few tens of thousands? I forget) for actually testing the performance of the algorithm. Having separate training and test sets is pretty standard for this type of problem; if you use the same set for both you risk overspecialization, where the algorithm does well on data it's been fed but poorly on any other data.

One hundred million ratings is a pretty freaking huge number of ratings. We used smaller datasets (5M and 20M ratings) to actually test the algorithms, because running a test against the 100M rating dataset can take hours or days (or weeks, if your code is especially sloppy). What makes it worse is that the 100M ratings we were given are only a small portion of all the ratings that were possible. It's convenient to view the data as a table, with movies along one axis and users along another, but keep in mind that the 100M ratings would only fill about 1% of this table. Even treating this table as a sparse matrix in Matlab, it took several gigabytes of RAM just to keep in memory.

Anyway, I'll spare you all the details of the algorithm my group implemented for the project (iterative trace-norm minimization by singular value thresholding; I only have the vaguest grasp of why it works myself :( ), but we ended up with a roughly 1% improvement over Cinematch's accuracy. We could definitely have done better, I think, but we were severely crunched for time toward the end (and actually, quite a bit of the work was done in a frantic all-nighter the night before the project was due, since we got the due date wrong). Realistically speaking, if we'd had time to properly test and tune the algorithm, we might have been able to beat Cinematch by 2-4%.

2. The Broken Window

What are the odds? I write a blog post on data mining, and then on the flight back from Pittsburgh I pick up The Broken Window, by Jeffrey Deaver, and it's about... data mining. :o

(Quick plot summary: The killer has access to a megadatabase created by a data mining corporation, and uses it to frame other people for his crimes.)

I picked up the book mainly because the last book I read by Jeffrey Deaver (The Blue Nowhere) was about hackers, and somehow, he got everything basically right. An author that can write a sto'ry involving hacking, and do their homework so thoroughly that I am satisfied with the result, is an author that has earned my respect. The Broken Window is a good read, if not a stellar mystery novel, and factually accurate enough that I didn't throw the book down in disgust (looking at you, Dan Brown).

On the other hand, the book just felt late, but this isn't necessarily Deaver's fault. Put it this way: fifteen years ago, this would have been science fiction. Ten years ago, it would have been seen as prescient. Five years ago, it would have been a timely novel; today (and more importantly in 2008, when it was published) it feels just a bit dated. For example, one of the pivotal moments in the book involves the main characters, a group of top-notch detectives and police officers, finding out about RFID tags for the first time.

I get the feeling that this was supposed to be a mystery novel, but it was missing something. Mysteries are supposed to have an "aha!" moment, when the culprit is revealed, and all the pieces come together in your mind. Instead, for me, it was more like an "oh." moment - it made sense, but there weren't enough clues there to make it an "aha!". The solution was just too unexpected. Frankly, I think Deaver spent too much effort on misdirection, making you think it was the CEO the whole time. Sterling, the CEO, was an obsessive, secretive type, and therefore entirely too obvious to be the actual culprit. Didn't stop Deaver from hinting that he was for 2/3 of the book, though. :/

All in all, though, I'd give this book a 7/10. It had some flaws, but it was a fun and worthwhile read overall; better than anything that hack James Patterson ever wrote. :p (I picked up one of his books for a flight one time. Considering how often people recommend him to me, it sure was terrible. XD)


Kiriska said...

Hey! Way to spoiler stuff! I was completely interested in reading that book until you mentioned stuff about the CEO. :( Maybe I will read some other book by the guy then. :(

I have very little to say about the Netflix thing considering whatever you said about your group's algorithm was way over my head, but the mentioning of the various groups merging together and stuff as the competition went on reminds me of our Critter evolution. Thingy. Yup. :3

P. Static said...

It only seems like the CEO did it for a bit, after which it becomes way obvious that he's just trying to make you think that >_> so not a terrible spoiler XD

Man, that critter project was fun :D Wish they would assign us more competitive stuff like that D:

Æther said...

Hm? I seem to remember you chatting with your group on the phone for that project, though all I knew about it at the time is that you were mining data.

As for your summer reading, I'm rather glad you can't spoil what I'm reading. (Especially if I don't tell you what it is.) :)