06 May 2007

I hadn't seen much discussion of this on the web, so I thought I would post the link to this May 2007 paper from Google:

Google News Personalization: Scalable Online Collaborative Filtering

The abstract:

Several approaches to collaborative filtering have been studied but seldom have the studies been reported for large (several millions of users and items) and dynamic (the underlying item set is continually changing) settings. In this paper we describe our approach to collaborative filtering for generating personalized recommendations for users of Google News. We generate recommendations using three approaches: collaborative filtering using MinHash clustering, Probabilistic Latent Semantic Indexing (PLSI), and covisitation counts. We combine recommendations from different algorithms using a linear model. Our approach is content agnostic and consequently domain independent, making it easily adaptible for other applications and languages with minimal effort. This paper will describe our algorithms and system setup in detail, and report results of running the recommendations engine on Google News.

They use the Movielens dataset as one of the case studies, so there are some possible applications to the Netflix Prize. The part I found interesting was the first detailed description of using the MapReduce model to run large-scale Expectation Maximization (EM) computations in parallel. An implementation of this on Hadoop and Amazon EC2 will let you tackle some large scale machine learning problems.