Robo Home | Changes | Preferences | AllPages

For those that haven't heard (all two of you).. Netflix is challenging developers to improve their recommendation system. Details here: http://netflixprize.com/. Basically, astound everyone and you win one million us dollars - significantly improve things for the year and you might win fifty thousand.

This contest pitches to the sweet spot of all bot authors trying to improve their bots through statistical machine learning (those of you using guess factors or dynamic clustering, you've got no excuses). I like money as much as the next guy but the beautiful thing about this contest is its realistic dataset - you're expected to predict user ratings given 17,000+ movies and 100,000,000+ ratings of those movies. Yup, that's well over a million. Pretty nice for developers currently limited to a few thousand enemy scans.

The cynical might assume the contest will be dominated by PhD? specialists but this article suggests there's hope for all: http://jsnell.iki.fi/blog/archive/2006-10-15-netflix-prize.html. I've downloaded and played with the dataset a bit. It's daunting - the first thing you've got to decide is how you're going to manage all that data. I've opted to store everything in a database since that's what I'm used to... can't tell if that's the right approach but it's a start...

There's a chunk of cash ear-marked for robowiki.net if I strike gold ;).

I'd be curious to hear stories if anyone plays with this dataset. -Corbos

Thats kinda nutso, but totally possible. Basically use the most simplistic (i'm assuming there is new data added over time) dynamic clustering system to determine the closest set of current data, or even a more gun-like impimentation of matching the closest matches over time to determine the next set to come. Easier said then done. If its just a pot of data with not data increase over time a simple mean I think would be the best way to find the next result. I perfer simplicity over complexity.

OH, perhaps throw in the data sorting capacity of a Folded Pattern Matcher into the system allowing quick data lookup. However such a HUGE data set would require a form of redundant storage. Scary thought to have all that data active and in memory.

I would give this a shot if I thought I had a shot, but I just don't know enough about how to handle that massive of a dataload.

OH, also set up a system that will analyze and place "outlayers" in thier own redundant dataset based on majority of bases. That should increase the score by a good bit, as you won't have random spices spoiling the stew. -- Chase-san

I remember hearing about this once, but it had slipped my mind until you posted this. Now I'm really interested (I like money, too). I'll definitely take a look at this if I can find some time... -- Voidious

Robo Home | Changes | Preferences | AllPages
Edit text of this page | View other revisions
Last edited November 28, 2006 14:55 EST by Voidious (diff)