21 November 2008

Amazon announced their Hosted Public Data Sets service today, and I expect it to be a game changer. Finding and using datasets on the web just got a lot easier. Similar to how developers can share Amazon Machine Images on EC2, you can now freely share large datasets in the cloud using Amazon EBS snapshots.

A few months ago, Jeff Bar stopped by Juice to talk with our team about how we are using Amazon EC2 and SQS to scale our data mining efforts. One of the issues I brought up was the potential cost and hassle of shuffling large datasets on and off AWS. Jeff discussed his concept of using Amazon as a kind of data & application ecosystem, where various companies, researchers, and data providers interact on AWS and take advantage of the transfer efficiencies of staying within the Amazon infrastructure and using data and APIs locally.

This seems to be a part of that vision, and I'm looking forward to unleashing Hadoop on whatever data flows into the system.

From the AWS Public Data site:

Select public data sets are hosted on Amazon EC2 for free as Amazon Elastic Block Store (Amazon EBS) snapshots. Amazon EC2 customers can access this data by creating their own personal Amazon EBS volumes, using the public data set snapshots as a starting point. They can then access, modify and perform computation on these volumes directly using their Amazon EC2 instances and just pay for the compute and storage resources that they use. If available, researchers can also use pre-configured Amazon Machine Images (AMIs) with tools like Inquiry by BioTeam to perform their analysis. To get started using the Public Data Sets on AWS, simply perform these three easy steps:
  • 1. Sign up for an Amazon EC2 account.
  • 2. Launch an Amazon EC2 instance.
  • 3. Create an Amazon EBS volume using the Snapshot ID listed in the catalog above for your chosen snapshot.
...If you have a public domain or non-proprietary data set that you think is useful and interesting to the AWS community, please submit a request below and the AWS team will review your submission and get back to you. Typically the data sets in the repository are between 1 GB to 1 TB in size (based on the Amazon EBS volume limit), but we can work with you to host larger data sets as well. You must have the right to make the data freely available.