05 October 2009

High level languages for MapReduce

Big data hackers, Apache Hadoop developers, and early adopters from several industries descended on the Roosevelt Hotel this weekend for Hadoop World NYC. I gave a talk on rapid prototyping of data intensive web applications using Hadoop, Hive, Python, and Ruby on Rails. The talk also had a few bits about using R with Hadoop for statistical computing at scale. The sessions were taped, and you can check out the video of my talk here: "Building Data Intensive Apps with Hadoop and EC2".

The slides give a high level overview of how I built the open source trend tracking site trendingtopics.org over a few weeks last June using Amazon EC2 and Cloudera tools. The code for the site is on Github and the raw data it is powered by is available on Amazon Public Data Sets. I've also posted a series of tutorials related to trendingtopics on the Cloudera blog over the past few months:

Here are a few resources mentioned in the talk:

Conference Highlights

It felt like around half of the attendees of Hadoop World were developers or data hackers I know of via Twitter or the Hadoop mailing lists. This resulted in some decent Twitter coverage via the hadoopworld hash tag. The other half of attendees represented enterprise IT, media companies, government, and financial firms who are either early adopters or interested in using Hadoop.

Some interesting announcements were made in the morning. Amazon added new features for the Elastic MapReduce service, including support for Hive, Cloudera's Hadoop Distribution, and integration with Karmasphere Studio.

Cloudera's big news was the launch of Cloudera Desktop, a new web-based unified user interface for users and operators of Hadoop clusters. Note that you can also run the Cloudera Desktop on Amazon EC2. Cloudera announced support for their distribution on Softlayer and Rackspace. They also outlined new features in the latest Hadoop distribution (CDH2), which includes support for HBase and Hadoop 0.20.1.

Vertica announced a partnership with Cloudera, which is an interesting development considering the RDBMS vs. MapReduce debates that took place last year.

I think I actually spent more time talking data with fellow hackers like Joshua Reich and Hillary Mason than I did in the talks, but still managed to catch some good ones by the EHarmony team, Stuart Sierra, Deepak Singh, and several Yahoo people. As a big Python user, it was exciting to hear that Jake Hofman from Yahoo! Research, NY plans on an open source release of a Python based Social Network Library for Hadoop, which he used to generate the Twitter analysis in his talk. A big theme in my talk and others I attended was the use of high level languages on top of Hadoop to accelerate development. Most of the teams I talked to actively use multiple abstractions on top of Hadoop, including Pig, Hive, Clojure, or other languages like Python through Hadoop Streaming.

For further details check out these notes from other attendees: