Statistics Can Find You a Movie, Part 1
Netflix has over 100,000 DVDs to choose from, so if you know what movie or program you want, Netflix probably has it. But when you don’t know what to watch, 100,000 DVDs may be choice overload. To help its customers (and keep them subscribing), Netflix makes close to 2,000,000 personalized recommendations per day, accounting for 60% of its business.
These recommendations are made automatically by the Netflix-developed Cinematch recommender tool, which works on the principle that if two people like the same DVD, they’re likely to have other choices in common.
It’s a statistical approach, meaning past data is collected and then analyzed to derive rules for making predictions. Every time you rate or order a DVD, you’re creating a roadmap of your preferences and defining yourself according to a collection of likes and dislikes, inclinations or avoidances. Statisticians can look at the data collected about you and analyze it to make good predictions about what you will want to watch, buy, or do in the future.
There are different ways to analyze the data. One common approach, called nearest neighbors, groups together DVDs that are rated similarly by the same people. If a group of people gave Saving Private Ryan five stars, a nearest neighbors method would find other movies rated five stars by the same group of people, forming a neighborhood of movies similar to Saving Private Ryan; close neighbors of Saving Private Ryan would probably be other war movies.
Movies rated similarly by customers with the same viewing preferences are close to one another. If you liked one movie, you’ll probably like its neighbors.
Nearest neighbors is used by many recommender tools (Cinematch probably incorporated some version of it), and over the years, the method has been refined and modified.
Certainly Netflix had been tweaking their algorithms for years to improve performance. But at a certain point, further tweaking wasn’t getting better results, and it became increasingly important to find out whether other methods might work better.
It’s all about the data
So, in a smart, savvy move, the company decided to stage an open competition, inviting anyone to improve upon the accuracy of Cinematch by 10%, and offering a $1,000,000 prize to the first who could do it.
And they released their data.
In one big data give-away, Netflix released to the public 100,000,000 anonymous viewer recommendations for 18,000 DVDs (mostly movies, but also TV programming and events such as concerts), an unprecedented amount of real-world data, and by far the largest set of such data ever released. Statisticians, data miners, machine-learning experts, and others took note.
This data could validate theories, test algorithms and models, and say definitively what works and what doesn’t, a question of consuming interest to Internet retailers. Normally such data is hard to come by. All companies collect customer data but normally hoard it for the competitive advantage it represents. Plus there are privacy concerns; most people don’t want personal information such as their viewing preferences scrutinized. (Netflix removed all personal information before releasing the data.)
Statisticians can look at the data collected about you . . . and make good predictions about what you will want to watch.
So while the million-dollar purse generated visibility and public interest, it was the data itself that validated the competition as a serious scientific endeavor and justified the mammoth number of hours professional and academic computer scientists and data miners would collectively invest over the course of the competition.
Rules and scoring
Rules were simple. Beat the accuracy of Cinematch by 10% within five years.
The first one to do it would get $1,000,000 dollars. In addition to the grand prize, progress prizes would be awarded each year to the team with the biggest improvement.
Other rules dealt with ownership, specifying that contest participants retain ownership of their code, but that the winning team must license it (non-exclusively) to Netflix.
Barred from competing were citizens Cuba, Iran, Syria, North Korea, Myanmar, Sudan, and . . . Quebec. Quebec’s inclusion on this list was due to Netflix’s interpretation of the province’s anti-sweepstakes law. Although odd—the contest was anything but a game of chance— it didn’t seem at the time worth the effort to research the matter further. What were the chances that a team from Quebec would be among the winners?
Competing teams could download a training set of 100 million time-stamped recommendations (DVD ID, viewer ID, rating) as well as a probe set for testing. They also downloaded a qualifying set of 2,800,000 predictions whose ratings were withheld by Netflix.
The idea was for teams to use the training set to create a model for predicting DVD preferences, and then try it out first on the probe set. When ready, teams could test their model by making predictions for the 2,800,000 withheld ratings and upload the predictions to Netflix; Netflix would then compare the team’s prediction set with the actual ratings.
(A quick summary of machine-learning. Predictive models are created by training them on labeled data so they “learn” patterns, such as which DVDs are rated highly by people who like action movies. The patterns encoded by the model could then be applied to unlabeled data to predict what the labels should be. In the case of Netflix, the labels were customer ratings. The more data, the better the model and the better the predictions. The instructions that tell the model what patterns to look for, and the relative importance or weight to attach to each one, are called algorithms.)
So the teams could gauge their progress, results of their predictions were posted to a Leaderboard, which reflected the team standings. (Actually, the Leaderboard reflected only half the qualifying set. The other half was kept in reserve to determine the final winner. See sidebar for an explanation.) Teams could also learn a bit more about what worked and what didn’t, at least on half of the predictions. If submissions were unlimited, the teams could systematically go through each of the 2,800,000 predictions, changing one rating at a time to see whether their score moved up or down; for this reason, Netflix restricted each team to one submission per day.
Oversubmitting also carried the risk of overfitting—in which the model fits the quiz set so perfectly it becomes too specialized to predict a different data set. If the engineers tweaked their model according only to Leaderboard feedback, and not by learning more about the data patterns, their model would not generalize for the hidden test set.
The contest would test the limits of machine learning, data mining, and statistical data analysis, all areas that AT&T actively researches.
A team attaining a 10% improvement would trigger a 30-day final call for all teams to have a last chance to submit a better prediction set. At the end of the 30 days, the team with the best score against the hidden set would be declared the grand prize winner.
The scoring metric was root-mean squared error (RMSE). Cinematch’s RMSE on the test set was 0.9525, which translates on a scale of 1-5 stars to almost a full star. The goal was to reduce the RMSE by 10% to 0.8572 (1/10th of a star). While such an improvement doesn’t seem like much, across hundreds of millions of recommendations, it results in many more successful recommendations.
As the contest got underway, AT&T researcher and statistician Chris Volinsky began talking to his colleagues Yehuda Koren, a computer scientist, and Robert Bell, also a statistician, about participating.
The contest would test the limits of machine learning, data mining, and statistical data analysis, all areas that AT&T actively researches. Understanding data and the patterns in data touches on all aspects of AT&T operations and is especially important for fraud detection and other security issues. The size of the Netflix dataset was also big enough to evaluate which data mining techniques worked at very large scales; for AT&T, which operates some of the world’s largest networks, the problem of scale is a perennial issue.
AT&T’s interest was also practical since the company itself is developing automatic recommendation systems for its consumer and entertainment offerings, including U-verse.
The three researchers together had a good mix of expertise. Volinsky and Bell, coming from a traditional statistics background, were used to looking at data patterns to differentiate statistically significant trends from random noise; the datasets they worked with were usually small but complete. Yehuda Koren, coming from a computer science background, was used to working with large matrices, often with sparse data, to make predictions.
Of course, none of the three knew anything about recommender systems. They worried about being competitive.
The contest begins
The contest started on October 2, 2006 with the release of the data. The first postings occurred within a day. Within one week, Cinematch had been equaled, and within three weeks, bested, and by 3%. Probably the first postings were variations of nearest neighbors.
There was a lot of action on the Leaderboard the first few weeks as teams moved the RMSE lower, and teams moved up and down as they discovered new methods or tweaks.
More action was taking place on the Netflix forum, where teams and others were busy exchanging ideas, weighing in on what methods might work best, speculating whether 10% was even achievable, and commenting on the Leaderboard action. But the forum extended beyond simple commentary. Teams voluntarily divulged their methods, posting detailed descriptions. Other teams asked for help, and got it. Bits of working code were posted for all to use, contributing to a collegial, cooperative spirit that prevailed almost to the end of the competition.
Some big names were competing; among them, the machine-learning specialists from the University of Toronto; also entered was Gábor Takács, a well-known data miner from Hungary. Much Leaderboard speculation centered on what other big names might be behind some of the other teams.
As excitement built and the pull to compete became hard to resist, Volinsky, Bell, and Koren decided to go ahead and download the data. They figured they could spend a few hours here and there, developing their methods and learning from one another and from the free exchange of information on the forum. Volinsky suggested the team name BellKor, which combined two of their last names to allude to BellCore, the research facility for the original Seven Baby Bells. Bell and Koren would be responsible for most of the model building.
Within one week, Cinematch had been equaled, and within three weeks, bested, and by 3%.
BellKor’s first posting was on November 27 with an RMSE of 0.9209, putting them in 23rd place; by the next week they would be within the top 10. BellKor’s first models were built using principal component analysis, a statistical method that attempts to discover and reduce redundancy in data. Most other teams were probably working on variations of nearest neighbors, and may have continued along in this vein except for the sudden appearance of a new team, Simon Funk. Simon Funk jumped from nowhere to number 3 on the Leaderboard using a different method.
The method was singular value decomposition (SVD). Although not new, SVD had rarely been used before for recommender systems, probably because traditional SVD does not perform well with sparse data sets. And the Netflix dataset was sparse; although there were many ratings (100,000,000), most users had not rated most of the movies; in fact, 99% of the possible user-DVD ratings were missing.
What Simon Funk had done was adapt SVD to efficiently treat only the known values, rather than incorporating methodologies for estimating missing values, which was the usual thing to do for mathematical applications of SVD. All this became known because Simon Funk, in the spirit of the competition, almost immediately published his method to the Internet for all to see, almost inviting other teams to copy his method by titling his post Try this at Home.
Other teams did, and eventually BellKor would as well.
Beyond the neighborhood
SVD is one example of a family of data-mining techniques known as dimension reduction. By dimension is meant the intersection between each DVD and each customer. The Netflix data set had 18,000 (DVDs) x 480,000 (customers) for a total of 8.5 billion possible combinations.
For simplicity, consider a two-dimensional matrix of comedy and violence, where each DVD is represented by two numbers, one indicating the amount of comedy and the other the level of violence. A movie like Rush Hour would have large values for both, while No Country for Old Men would have a small value for comedy and a large one for violence. People likewise can be represented according to their preference for comedy and violence.
Using a common means to evaluate both DVDs and customers reduces the number of dimensions while capturing the essence of rating preferences. (For those who understand, SVD looks at a projection of data into lower dimensions such that the projection has maximal variance.)
Rather than a single matrix of 18,000 x 480,000 dimensions, using factors reduces the dimensionality to two smaller matrices: one of DVDs (18,000 x n), and one of customers (480,000 x n), where n is the number of factors. BellKor and other teams worked with 50-200 factors.
The programmers don’t have to figure out the factors; they just run the algorithm until the algorithm, by sorting the DVDs the specified number of times, defines the factors and assigns a rating for each factor. The result is a big file with rows of numbers, each row consisting of a DVD ID followed by a string of ratings, one for each factor.
The programmers don’t really care what the factors are, they just care about getting the right number of factors. The number should be large enough to distinguish why someone likes one DVD and not another. As the number of factors increases, more subtle distinctions are possible (action movies with car crashes but no blood), making for better customer-DVD matches.
At a certain point, increasing the number of factors brought less and less improvement until the higher computational time no longer justified further increases.
SVD by itself proved better at making predictions than nearest neighbors, though the reasons aren’t all that clear. Perhaps SVD, by taking a more holistic approach to the entire dataset and looking at all the ratings made by a customer simply had more information on which to operate than nearest neighbors, which looked only at the ratings that it deemed relevant. As the competition progressed, using more information almost always improved accuracy, even when it wasn’t immediately obvious why the information mattered or how little the information contributed.
Nearest neighbors looks only at how a customer rated DVDs within a single neighborhood
Nearest neighbors when constructing the neighborhoods of DVDs may also have depended too much on finding customers who were exact matches with other people, when in reality, maybe we are not all that similar, and it takes multiple dimensions to figure out the differences.
The sparsity of the dataset might also have limited the accuracy of nearest neighbors. With many customers rating no more than 20 DVDs (the mean was 48 DVDs), some DVDs may have had only a few neighbors or even none at all.
However, SVD had its own drawbacks; it was not as good as nearest neighbors at finding strong correlations among closely related films. Where it might seem obvious to recommend the second Lord of the Rings DVD to someone who liked the first (as nearest neighbors would have done), SVD didn’t always make this connection.
So BellKor and the other teams explored first one method and then another, trying to correct each method’s shortcomings as best as they could, systematically tweaking one parameter and then another until finally the only obvious course was just to use both methods.
And this they did; running SVD first to cover the entire dataset using multiple factors, and then running nearest neighbors to concentrate on a small, closely related group of DVDs.
But it was a version of nearest neighbors specifically tailored for the competition. Where traditional nearest neighbors gives more influence to close neighbors by assigning them heuristic weights without any direct relation to the target cost function, BellKor looked specifically for a weighting algorithm that worked to lower the RMSE, which was after all the whole point of the competition.
The algorithm BellKor developed adjusted the amount of weighting applied to each neighbor DVD depending on the customer. For example, in a neighborhood of animated movies, movies with a modern sensibility might be more weighted for those who would like
. . . finally the only obvious course was just to use both methods.
Nor was SVD the only factoring method used; other methods for deriving factors—collectively termed latent factor models—included principal component analysis and Restricted Boltzmann Machine (a neural network method version invented by Geoffrey Hinton, who was competing on the University of Toronto team). Each method finds different factors, adding to the information that can be exploited.
The output of each model could be treated separately or used as input for another model. Final outputs would then be combined, or blended, according to how much accuracy each was assumed to have.
To give an idea of how well blending worked: BellKor at the end of the first year, had a 6.6% improvement with a single-method SVD implementation; when 107 outputs were blended (which is what they had at the end of the first year), improvement increased to 8.4%.
First progress prize
By the time the teams had done the core model building and started developing techniques for combining models, the competition’s first year was coming to a close, with the first progress prize to be awarded to the team with the best improvement as of October 1.
BellKor had been in or near the lead since March and was comfortably in the lead the week before the first progress prize deadline. But competition was close throughout. Any team happening upon a new idea, method, or tweak before the other teams found it, could jump ahead at least until someone published their method for the benefit of all.
This Leaderboard snapshot shows BellKor seeming to have a good lead as of 5:00 pm September 30, the day before the deadline for the first progress prize.
An hour later, at 6:00 pm, something unexpected happened. The fifth- and sixth-place teams, Arek Paterek and basho, combined their predictions and posted a joint submission, moving past the second and third teams (Gravity and Dinosaur Planet) into second place, surprising everyone.
Later that same day, with twenty-five hours before the deadline, BellKor submitted a posting to achieve an 8.38% improvement, the highest to date. But 72 seconds before BellKor’s submission, the two veteran teams Gravity and Dinosaur Planet, who had just been passed by the combined basho/Arek Paterek, themselves combined forces (forming the team When Gravity and Dinosaurs Unite) to submit a posting with the exact same 8.38%, knocking BellKor into second place due to its slightly later posting time.
The idea of teams combining was indeed something new, and generated much discussion on the Leaderboard. Not everybody at this point agreed it was entirely fair play. But it was effective.
BellKor, taken a bit by surprise, had to make up ground. Team members worked through the night to find some way of improving their score, coming up with two different ways of combining outputs. Not knowing which worked better, they decided to submit both. BellKor could not submit twice on one day, but the rules allowed for new teams to form as long as they didn’t have the exact same members. So it seemed a good time for Chris Volinsky (who had never been an official member of BellKor) to join his colleagues Bob Bell and Yehuda Koren on a new team, KorBell. Within minutes of one another and with only 20 minutes or so remaining before the deadline, BellKor and KorBell submitted prediction sets, with KorBell having the better improvement of 8.43%. BellKor’s score, 8.41%, was enough for second place.
Notification of the win came within five minutes after the deadline via an email from Netflix; the Leaderboard updated shortly after.
Winning the progress prize was a bit of a double-edged sword; it was good to win but contest rules dictated that the winning team publish its methods. BellKor’s paper, thoroughly vetted by Netflix, was made public on November 13, and it can be imagined that the most avid readers were other teams. In any case, within a week, most teams improved upon their previous scores, and the playing field leveled a bit for the next year.
At the end of year 1, progress had been good, improving on Cinematch by an impressive 8.43%. But in the next year, progress would slow considerably.
The Netflix Prize
|BellKor Home Page|
The BellKor Solution to the Netflix Grand Prize
Required by competition rules in order to claim the Grand Prize.
Matrix Factorization Techniques for Recommender Systems
IEEE Computer 2009
The Million Dollar Programming Prize
IEEE Spectrum 2009
The BellKor 2008 Solution to the Netflix Prize
Required by competition rules in order to claim the 2nd Progress Prize.
Recent Progress in Collaborative Filtering
Lessons from the Netflix Prize Challenge
SIGKDD Explorations, Volume 9, Issue 2
The BellKor Solution to the Netflix Prize
Required by competition rules in order to claim the 1st Progress Prize.
Improved Neighborhood Based Collaborative Filtering
KDD 2007 Netflix Competition Workshop
Number of Netflix customers in training set:
Qualifying data set:
(divided 50-50 into quiz set and test set; quiz set results posted to Leaderboard)
Rating dates for training set:
10/98 to 12/05
1 - 5 stars
RMSE to win:
Contest start date:
Oct. 2, 2006
Number of teams submitting:
Number of submissions:
A measure of accuracy derived by comparing predicted values to actual values, and then finding the root square of the error.
This plot shows predicted values along a fitted line with actual values shown here as points.
The RMSE is the distance, on average, of a data point from the fitted line (representing predictions made by the model), measured along a vertical line.
To calculate RMSE, for each point take the distance vertically from the point to the corresponding y value on the line (the error), and find the square of the value. Add up all those values for all points, and divide by the number of points. Finally, report the square root of the result. Squaring prevents negative values from canceling positive ones
The 2,800,000 test set was divided by Netflix into two equal-size subsets: a quiz set and a final test set. Results against the quiz set would be posted immediately to a public Leaderboard, which listed the standings of the teams. The final test set was reserved to determine the winners of the progress and final prizes. Only Netflix knew which recommendations belonged to the quiz set and which to the test set.
Winning a progress prize or the grand prize depends on having the best score against the test set (not the quiz set).
It’s important to point out the Leaderboard reflected only the quiz set predictions, not the test predictions. Teams would not know how they scored against the final test until after the competition was over.
AT&T Research team. Winner of the 2007 progress prize (under name KorBell).
Yehuda Koren. A computer scientist in the information visualization group who at the time was applying linear algebra to network visualizations. Koren started the competition while at AT&T Research, then moved to Yahoo! in September 2008 while remaining on the BellKor team.
Robert Bell. A statistician specializing in machine learning methods, analysis of data from complex samples, and record linkage methods.
Chris Volinsky. A statistician with expertise applying statistical and data mining techniques to the real-world problems of massive data sets. Also Director of the Statistics Research Department.
Major methods: SVD, Restricted Boltzmann Machines, variations on nearest neighbors.
Individual. Made a dramatic jump to third spot by adapting SVD for the competition. His post Try this at Home detailed his methods for everybody to see.
Team led by Gábor Takács, a well-known data mining expert, and three PhD candidates from the University of Budapest. The team was consistently at the top of the Leaderboard and combined with Dinosaur Planet to take first place the day before the Progress Prize deadline, before being edged out for the prize only in the last few hours by KorBell.
Made up of Princeton undergraduates. The team was always in the top teams during the first year and combined with Gravity to almost win the First Progress Prize.
Highest-rated (#8) single-model (no blending). Was one of the first teams to combine with another team (Arek Paterek) to improve RMSE.