Skills, Reputation, and Search

September 17, 2013

Video is now available of my keynote at Lucene Solr Revolution 2013 in San Diego on Skills, Reputation, and Search.


Slides & Thoughts from Hadoop World NYC

October 5, 2009

High level languages for MapReduce

Big data hackers, Apache Hadoop developers, and early adopters from several industries descended on the Roosevelt Hotel this weekend for Hadoop World NYC. I gave a talk on rapid prototyping of data intensive web applications using Hadoop, Hive, Python, and Ruby on Rails. The talk also had a few bits about using R with Hadoop for statistical computing at scale. The sessions were taped, and you can check out the video of my talk here: "Building Data Intensive Apps with Hadoop and EC2".

The slides give a high level overview of how I built the open source trend tracking site over a few weeks last June using Amazon EC2 and Cloudera tools. The code for the site is on Github and the raw data it is powered by is available on Amazon Public Data Sets. I've also posted a series of tutorials related to trendingtopics on the Cloudera blog over the past few months:

Here are a few resources mentioned in the talk:

Conference Highlights

It felt like around half of the attendees of Hadoop World were developers or data hackers I know of via Twitter or the Hadoop mailing lists. This resulted in some decent Twitter coverage via the hadoopworld hash tag. The other half of attendees represented enterprise IT, media companies, government, and financial firms who are either early adopters or interested in using Hadoop.

Some interesting announcements were made in the morning. Amazon added new features for the Elastic MapReduce service, including support for Hive, Cloudera's Hadoop Distribution, and integration with Karmasphere Studio.

Cloudera's big news was the launch of Cloudera Desktop, a new web-based unified user interface for users and operators of Hadoop clusters. Note that you can also run the Cloudera Desktop on Amazon EC2. Cloudera announced support for their distribution on Softlayer and Rackspace. They also outlined new features in the latest Hadoop distribution (CDH2), which includes support for HBase and Hadoop 0.20.1.

Vertica announced a partnership with Cloudera, which is an interesting development considering the RDBMS vs. MapReduce debates that took place last year.

I think I actually spent more time talking data with fellow hackers like Joshua Reich and Hillary Mason than I did in the talks, but still managed to catch some good ones by the EHarmony team, Stuart Sierra, Deepak Singh, and several Yahoo people. As a big Python user, it was exciting to hear that Jake Hofman from Yahoo! Research, NY plans on an open source release of a Python based Social Network Library for Hadoop, which he used to generate the Twitter analysis in his talk. A big theme in my talk and others I attended was the use of high level languages on top of Hadoop to accelerate development. Most of the teams I talked to actively use multiple abstractions on top of Hadoop, including Pig, Hive, Clojure, or other languages like Python through Hadoop Streaming.

For further details check out these notes from other attendees:

How FlightCaster Squeezes Predictions from Flight Data

August 24, 2009

FlightCaster During the last several years, an increasing number of systems within government and industry have been collecting massive amounts of raw data which often sits untapped in large data warehouses. FlightCaster strikes me as a great example of the next generation of web applications that will leverage that data: bootstrapped startups that apply machine learning and data processing at scale to solve a focused problem people actually care about.

From the site:

"FlightCaster predicts flight delays. We use an advanced algorithm that scours data on every domestic flight for the past 10-years and matches it to real-time conditions. We help you evaluate alternative options and help connect you to the right person to make the change."

FlightCaster uses data from:

  • Bureau of Transportation Statistics
  • FAA Air Traffic Control System Command Center
  • FlightStats
  • National Weather Service

I've been following the data crunching exploits of Bradford Cross on Twitter, and the launch of FlightCaster seemed like a great opportunity for an "in the trenches" interview on building a machine learning application with Rails & Hadoop. During the interview on FlightCaster, Brad describes some of the challenges of working with flight data, statistical approaches for flight prediction, false negatives in FlightCaster, Clojure, Hadoop & Amazon EC2, YCombinator, and lots more.

Of the 9? people at Flightcaster, how did the roles break down? How did you get started on the problem?

People at YC make fun of us a lot on account of our monolithic team.

We have a core team of 5 founders which breaks down as follows; a CEO, an air travel domain expert, two engineers working on web service, apps, and production issues, and me for research.

I have a secret agent collaborator working with me who has been very helpful with research and scalable compute infrastructure.

We also have a few other engineers doing a mix of stuff. This is a speed thing. We wanted to launch an iPhone app, blackberry app, and website all at the same time. We built it all very fast.

We got started on the problem thanks to Evan Konwiser and his peculiar and endearing obsession with the commercial air travel industry.

Did you have someone on the team with domain expertise who was familiar with the flight or weather data sources?

Absolutely, that is Evan Konwiser. He has been instrumental, and not just for familiarity with the data.

The airline industry is a very idiosyncratic industry. It would be hard to learn a lot of the subtle domain logic via induction alone. Our machine learning approach uses a mix of analytical and inductive learning.

What was the biggest challenge in working with the public flight and weather data? Is there a dataset that isn't out there right now that might make predicting flight delays easier for you?

The public data set that we use is the "on-time database" published by the FAA. The data set is tricky to get all in one place since the FAA does not provide any decent API to it. The biggest issue is that we make real time predictions, so we needed a historical set of captured real time data, which we had to create ourselves.

Having a more amalgamated real time dataset going back historically for a decade would be a big help. Having more modernized ways of accessing the data would be helpful.

Until then, if anyone wants to buy it, we will sell it to them for a very high price ... and to sweeten, they must throw in a few obscure, expensive machine learning books that I have on my Amazon wish list.

I've been a big fan of using the combination of Rails, Hadoop, and Amazon EC2 along with a high level language (in your case Clojure). Any tips for people out there thinking of using a similar technology stack? How cost effective is running Hadoop on EC2 for you?

Building layer upon layer of abstraction is a big key. On the jvm, you have to do this, it is the path around the verbosity of Java and the vast abyss of poorly done APIs. You just keep searching until you finally find the folks who have built a sane, high level API on top of the thing you want to use - then you wrap it in a high level language like Clojure. The technical term for this is "wrap the crap."

In our case, we use Cascading as our step up in abstraction on top of Hadoop.

S3 -> EC2 -> Cloudera -> HDFS -> Hadoop -> Cascading -> Clojure. I'm not sure if those layers are exactly the right order, but you get the point. The key is go keep layering until you encapsulate the plumbing and get to the level of abstraction that lets you focus on solving your problem.

Running Hadoop on EC2 has been very cost effective. The biggest issues have come into play with the disconnect between Hadoop and S3. S3 expects open connections to keep reading, and if they don't, S3 terminates them. S3 is very much the Arnold of the distributed file system world. So if your Hadoop jobs are compute intensive, and they are buffering in data in a lazy loading fashion, they tend to lose the connection to S3 during long processing phases. We've worked around this with some hackery, and we are working with Chris Wensel (of Cascading fame) on a more industrial strength solution to the problem.

There are many existing machine learning and statistical computing packages for R, Python, Java, C++, why did you choose to go with Clojure? What were the pros/cons of that approach? Do you use any of those other tools for prototyping or visualization?

Over the years, I have found that, in practice, the statistical and machine learning code is not the big thing to worry about. That code is fun to write, and often you want to tweak and extend the libraries in ways that they have not been designed for. So we use libraries and frameworks for the basics where we can, but we are OK to implement statistical and machine learning algorithms ourselves. I'm quite experienced with this anyway; all the way down to efficient custom data structures built on arrays.

The bigger problem to worry about is in the title of your site; the data wrangling. Especially pre-processing (filtering, transforming, etc) and the general fault-tolerant distributed compute infrastructure. I worked on this sort of thing during my time at Google, and it is far more complex than it seems. It is easy to grasp the concepts, and get an initial implementation, but the edge cases and last mile issues with respect to the fault tolerance are where you take the hit.

Hadoop is a wonderful solution for distributed computation, and since our code is all purely functional, it is very natural for us.

Clojure is ideal for both the data wrangling, as well as the statistical and machine learning code. Also, Clojure plays nice with everything on the jvm so we wrap and use lots of libs from the java world. Put this together with the distributed compute infrastructure that you get with Hadoop, and it starts to make a lot more sense to build these systems in Clojure on the jvm and use Hadoop than it does to use R, python, etc.

That said, if you just need to do quick and dirty prototypes, or don't have the need or the option to invest in infrastructure for distributed computing, R and SciPy are probably still the place to turn.

At Google, the research scientists prototype in python and R, and then port to C++ for the real scalable map reduce runs. We prototype and run in production on the same language and platform, and although it is not as fast as Google's C++ infrastructure, we do have the benefit that Clojure is very high level. For large runs, we parallelize with Hadoop, and we just run smaller tasks locally as if our Hadoop infrastructure were not even there. We are not coupled to it, and yet we don't need to port code or do anything special to run it in distributed mode.

Can you talk more about how you are using Hadoop for Flightcaster? Are most of the jobs I/O intensive, preprocessing a large volume of historical input data, or are they more often cross-validation or simulation runs? Do you tweak your live models then re-train against all the historical data with Hadoop?

Our jobs are CPU intensive - we do a lot of computation per unit of data, even in our data transformation jobs.

We train and test all our models offline, on captured real time data.

Eventually we would like to move to a more incremental online learning approach, but for now we re-train and re-test against newly captured data and re-deploy new the newly trained model.

Any tips for handling bad data/records in large MapReduce jobs?

We've been talking to Chris Wensel and cascading has some really cool stuff for this in the form of "filters" that we are not leveraging yet. We do a lot of filtering and scrubbing of our own during the preprocessing phase, but we will be looking at what cascading can do for us here very soon.

The biggest lesson we have learned about bad data is that we should spend more time up front with visualizing the data and making assertions (or filtering via predicates) to verify that our assumptions about the data hold.

I have had these issues in the past with financial data, but not as bad. Can a flight land before it takes off? Are there 68 hours in a day? These are examples of data that is not malformed or missing, but that violates fundamental properties that you expect to hold true, so they can be trickier to catch.

The big problem with not spending more time on this up front is that these issues are expensive to catch downstream because you will be looking through non-intuitive results of complex analysis jobs and it takes a long time to track down the root of such issues. It is similar to the argument for having good unit tests; it is cheaper to find issues at that level than at the system test level. It is also more sanity-preserving.

You probably can't say much about your secret sauce or algorithms, but can you discuss some general issues you encounter handling conflicting information sources or incomplete data? Do ensemble approaches or online learning play a role?

As I mentioned in a previous response, we want to get into more of an incremental online learning approach, and I think we will be able to soon. Of course this means certain approaches are in and others are out, but that may be OK in our case.

As of right now, we are using a combination of analytical and inductive learning. We compose individual classifiers that are themselves composed of rich domain logic. This is not an ensemble approach, though we may head in that direction very soon.

What we are doing now is more of a network-of-classifiers approach. It is a bit of a strange beast right now, so we would be remiss to call it a Bayesian network. Maybe we could call it a Cardono network in honor of Jerolomo Cardano. It is a bit of an eccentric network that is ahead of its time but waiting for some formalism to come along and straighten it out.

To a large extent, our current approach is an artifact of how quickly we have built our initial model. We have these rich individual domain specific classifiers that are targeted at different features from different data sources, and the approach we are using to learn the composition of these individual classifiers is an area of active research.

What is your general style for attacking prediction problems? Do you start with a single data source and get the entire pipeline running with a simple model, or you dive in building a more complex model with multiple data sources? How did Flightcaster compare to financial prediction?

I like to let simpler cases drive out a lot of the demand for infrastructure early on. As you suggest, I tend to start by getting something working end-to-end with a simple model and a single data source. Then you can start to evolve faster, join in other data sources, and try more deeply theoretical ideas; all on top of a quality infrastructure.

That said, some infrastructure is not required until you get to more complex models, so it is always an evolutionary process where model drives infrastructure.

I made a lot of mistakes early in my career in building trading models where I let me theories get too far ahead of what I could really test in practice. That is not a good place to be. Unfortunately, this is an easy mistake to make.

Flightcaster has been very different from my work in investing in a number of ways. FlightCaster makes predictions by turning the probability distribution estimation problem into a k-way classification problem.

This is a simpler problem than designing holistic investment strategies.

In investing, you have to answer many questions. Selecting the point at which buying and selling occurs is akin to predictive problems in other domains. However, this only answers the question of when to buy or sell.

You also have to decide what to buy or sell - which is called portfolio selection. You also have to decide how much to buy or sell - which is called portfolio allocation. There are also the topics that I like to call risk and exposure accommodation.

The latter subjects of portfolio selection, portfolio allocation, and risk and exposure accommodation are arguably more important than the former subject of timing or predicting entry and exit points. Parameter sensitivity analysis tends to show that return and risk-return metrics are less sensitive to variability in the entry-exit approach as compared with portfolio selection, portfolio allocation, and risk accommodation approaches.

These aspects of financial modeling make it significantly more difficult, and they also seem to largely account for the repeated demise of many so-called "quantitative trading" operations.

Can you say anything about decision points and false alarm rates in travel delay prediction? It seems like there is a big penalty (risk) if somebody misses a flight based on a bad prediction, but a minor annoyance if the flight is delayed and you don't predict it correctly. Will you guys publish some ROC curves at some point?

There are always two key metrics in learning - in IR they are called precision and recall. We seem to be at somewhere around 85% precision and 60% recall. So discovering more delays is where we need to focus, and our false positive delay predictions are not the big issue for us right now.

We compute confusion matrices, and use them to derive precision, recall, and our false positive and false negative rates.

When you turn numerical probability distribution estimation into k-way classification, one side effect is that not all false positives are equal. If we way you will be delayed by over an hour, but you are delayed by 45 minutes, that not bad at all. But if we say you will be delayed by over an hour and you are on time, that is much worse. So there is a distance aspect to false positives.

We don't see a lot of bad false positives now, that is, we don't appear to tell people they will experience a long delay when in fact they will be on time. The bigger issue seems to be our recall - there are a lot of delays that we are just not able to detect yet. Alternatively stated, we have a problem with false negatives in the on time class.

We also have a strong temporal element to all this. How do these numbers change as we get closer or further away from your departure time? We're working on this analysis right now, and we do hope to publish a lot of useful metrics soon.

How was it working with YCombinator?

YCombinator is amazing. The people are all great; both YC founders and the people they invest in. The exposure we have gotten from being part of the program is itself worth the equity stake they take in the company. YC has put together a really fascinating business model. I would like to be an investor in YC.

Trevor Blackwell is a YC founder, friend, and maker of fine quality luxury robots. Actually, his big new thing is telepresence robots. Check out the bots at

Trevor is the one who lured me into working with the Flightcaster team. I hadn't met Paul Graham and his Wife Jessica before YC - they totally rock!

It has been an amazing experience and I would encourage anyone to apply; especially if you want to build something that solves a real problem.

Wikipedia Page Traffic Statistics Dataset

June 11, 2009

I've published a Wikipedia Page Traffic Data Set containing a 320 GB sample of the data used to power (I'll talk about Trending Topics more in a upcoming post). The EBS snapshot includes 7 months of hourly page traffic statistics for over 8 Million Wikipedia articles (~ 1 TB uncompressed) along with the associated Wikipedia content, linkgraph, & metadata. The english Wikipedia subset contains ~2.5 Million articles.

It only takes a couple of minutes to sign up for an Amazon EC2 account and set up access to the data as an EBS volume from the Amazon Management Console.

If you want to work entirely from the command line, you will need to complete the steps in the Getting Started Guide. When you are set up to use EC2, launch a small EC2 Ubuntu instance from your local machine:

    $ ec2-run-instances ami-5394733a -k gsg-keypair -z us-east-1a

Once it is running and you have the instance id, create and attach an EBS Volume using the public snapshot snap-753dfc1c (make sure the volume is created in the same availability zone as the ec2 instance)

    $ ec2-create-volume --snapshot snap-753dfc1c -z us-east-1a
    $ ec2-attach-volume vol-ec06ea85 -i i-df396cb6 -d /dev/sdf

Next, ssh into the instance and mount the volume

    $ ssh
    root@domU-12-xx-xx-xx-75-81:/mnt# mkdir /mnt/wikidata
    root@domU-12-xx-xx-xx-75-81:/mnt# mount /dev/sdf /mnt/wikidata

See the README files in each subdirectory for more details on these datasets...


The good stuff is sitting in 5000 files in /mnt/wikidata/wikistats/pagecounts/

    /mnt/wikidata/wikistats/pagecounts# ls -l | wc -l
    /mnt/wikidata/wikistats/pagecounts# ls -lh |head
    total 260G
    -rw-r--r-- 1 root root  49M 2009-02-26 13:34 pagecounts-20081001-000000.gz
    -rw-r--r-- 1 root root  46M 2009-02-26 13:34 pagecounts-20081001-010000.gz
    -rw-r--r-- 1 root root  47M 2009-02-26 13:34 pagecounts-20081001-020000.gz
    -rw-r--r-- 1 root root  44M 2009-02-26 13:34 pagecounts-20081001-030000.gz
    -rw-r--r-- 1 root root  45M 2009-02-26 13:34 pagecounts-20081001-040000.gz
    -rw-r--r-- 1 root root  47M 2009-02-26 13:35 pagecounts-20081001-050001.gz
    -rw-r--r-- 1 root root  45M 2009-02-26 13:35 pagecounts-20081001-060000.gz
    -rw-r--r-- 1 root root  50M 2009-02-26 13:35 pagecounts-20081001-070000.gz
    -rw-r--r-- 1 root root  51M 2009-02-26 13:35 pagecounts-20081001-080000.gz

This directory contains hourly Wikipedia article traffic logs covering the 7 month period from October 01 2008 to April 30 2009, this data is regularly logged from the wikipedia squid proxy by Domas Mituzas.

Each log file is named with the date and time of collection: pagecounts-20090430-230000.gz

Each line has 4 fields:

projectcode, pagename, pageviews, bytes

    en Barack_Obama 997 123091092
    en Barack_Obama%27s_first_100_days 8 850127
    en Barack_Obama,_Jr 1 144103
    en Barack_Obama,_Sr. 37 938821
    en Barack_Obama_%22HOPE%22_poster 4 81005
    en Barack_Obama_%22Hope%22_poster 5 102081

Wikilinks (1.1G)

Contains a wikipedia linkgraph dataset provided by Henry Haselgrove.

These files contain all links between proper english language Wikipedia pages, that is pages in "namespace 0". This includes disambiguation pages and redirect pages.

In links-simple-sorted.txt, there is one line for each page that has links from it. The format of the lines is ready for processing by Hadoop:

    from1: to11 to12 to13 ...
    from2: to21 to22 to23 ...

where from1 is an integer labelling a page that has links from it, and to11 to12 to13 ... are integers labelling all the pages that the page links to. To find the page title that corresponds to integer n, just look up the n-th line in the file titles-sorted.txt.

Wikidump (29G)

Contains the raw Wikipedia dumps from March along with some processed versions of the data. One of the files I created provides a direct lookup table for Wikipedia article redirects in page_lookup_redirects.txt, which can be applied to tasks like name standardization and search:

Here is a sample query run when the file is loaded into MySQL:

   mysql> select redirect_title, true_title from page_lookups
               where page_id = 534366;
   | redirect_title                                 | true_title   |
   | Barack_Obama                                   | Barack Obama |
   | Barak_Obama                                    | Barack Obama |
   | 44th_President_of_the_United_States            | Barack Obama |
   | Barach_Obama                                   | Barack Obama |
   | Senator_Barack_Obama                           | Barack Obama | 
                          .....                           .....         

   | Rocco_Bama                                     | Barack Obama |
   | Barack_Obama's                                 | Barack Obama | 
   | B._Obama                                       | Barack Obama |
   110 rows in set (11.15 sec)    

The raw wikipedia dump file latest-pages-articles.xml was also post-processed using xml2sql to produce a set of tab delimited text files for use with Hadoop and other tools :

692M page.txt
115M redirect.txt
987M revision.txt
17G text.txt

the corresponding namespace0 files were created by limiting page.txt and redirect.txt as follows:

# grep '^[0-9]*       0       ' page.txt > page_namespace0.txt
# grep '^[0-9]*        0       ' redirect.txt > redirect_namespace0.txt

Previous Posts