During the last several years, an increasing number of systems within government and industry have been collecting massive amounts of raw data which often sits untapped in large data warehouses. FlightCaster strikes me as a great example of the next generation of web applications that will leverage that data: bootstrapped startups that apply machine learning and data processing at scale to solve a focused problem people actually care about.
From the site:
"FlightCaster predicts flight delays. We use an advanced algorithm that scours data on every domestic flight for the past 10-years and matches it to real-time conditions. We help you evaluate alternative options and help connect you to the right person to make the change."
FlightCaster uses data from:
- Bureau of Transportation Statistics
- FAA Air Traffic Control System Command Center
- National Weather Service
I've been following the data crunching exploits of Bradford Cross on Twitter, and the launch of FlightCaster seemed like a great opportunity for an "in the trenches" interview on building a machine learning application with Rails & Hadoop. During the interview on FlightCaster, Brad describes some of the challenges of working with flight data, statistical approaches for flight prediction, false negatives in FlightCaster, Clojure, Hadoop & Amazon EC2, YCombinator, and lots more.
Of the 9? people at Flightcaster, how did the roles break down? How did you get started on the problem?
People at YC make fun of us a lot on account of our monolithic team.
We have a core team of 5 founders which breaks down as follows; a CEO, an air travel domain expert, two engineers working on web service, apps, and production issues, and me for research.
I have a secret agent collaborator working with me who has been very helpful with research and scalable compute infrastructure.
We also have a few other engineers doing a mix of stuff. This is a speed thing. We wanted to launch an iPhone app, blackberry app, and website all at the same time. We built it all very fast.
We got started on the problem thanks to Evan Konwiser and his peculiar and endearing obsession with the commercial air travel industry.
Did you have someone on the team with domain expertise who was familiar with the flight or weather data sources?
Absolutely, that is Evan Konwiser. He has been instrumental, and not just for familiarity with the data.
The airline industry is a very idiosyncratic industry. It would be hard to learn a lot of the subtle domain logic via induction alone. Our machine learning approach uses a mix of analytical and inductive learning.
What was the biggest challenge in working with the public flight and weather data? Is there a dataset that isn't out there right now that might make predicting flight delays easier for you?
The public data set that we use is the "on-time database" published by the FAA. The data set is tricky to get all in one place since the FAA does not provide any decent API to it. The biggest issue is that we make real time predictions, so we needed a historical set of captured real time data, which we had to create ourselves.
Having a more amalgamated real time dataset going back historically for a decade would be a big help. Having more modernized ways of accessing the data would be helpful.
Until then, if anyone wants to buy it, we will sell it to them for a very high price ... and to sweeten, they must throw in a few obscure, expensive machine learning books that I have on my Amazon wish list.
I've been a big fan of using the combination of Rails, Hadoop, and Amazon EC2 along with a high level language (in your case Clojure). Any tips for people out there thinking of using a similar technology stack? How cost effective is running Hadoop on EC2 for you?
Building layer upon layer of abstraction is a big key. On the jvm, you have to do this, it is the path around the verbosity of Java and the vast abyss of poorly done APIs. You just keep searching until you finally find the folks who have built a sane, high level API on top of the thing you want to use - then you wrap it in a high level language like Clojure. The technical term for this is "wrap the crap."
In our case, we use Cascading as our step up in abstraction on top of Hadoop.
S3 -> EC2 -> Cloudera -> HDFS -> Hadoop -> Cascading -> Clojure. I'm not sure if those layers are exactly the right order, but you get the point. The key is go keep layering until you encapsulate the plumbing and get to the level of abstraction that lets you focus on solving your problem.
Running Hadoop on EC2 has been very cost effective. The biggest issues have come into play with the disconnect between Hadoop and S3. S3 expects open connections to keep reading, and if they don't, S3 terminates them. S3 is very much the Arnold of the distributed file system world. So if your Hadoop jobs are compute intensive, and they are buffering in data in a lazy loading fashion, they tend to lose the connection to S3 during long processing phases. We've worked around this with some hackery, and we are working with Chris Wensel (of Cascading fame) on a more industrial strength solution to the problem.
There are many existing machine learning and statistical computing packages for R, Python, Java, C++, why did you choose to go with Clojure? What were the pros/cons of that approach? Do you use any of those other tools for prototyping or visualization?
Over the years, I have found that, in practice, the statistical and machine learning code is not the big thing to worry about. That code is fun to write, and often you want to tweak and extend the libraries in ways that they have not been designed for. So we use libraries and frameworks for the basics where we can, but we are OK to implement statistical and machine learning algorithms ourselves. I'm quite experienced with this anyway; all the way down to efficient custom data structures built on arrays.
The bigger problem to worry about is in the title of your site; the data wrangling. Especially pre-processing (filtering, transforming, etc) and the general fault-tolerant distributed compute infrastructure. I worked on this sort of thing during my time at Google, and it is far more complex than it seems. It is easy to grasp the concepts, and get an initial implementation, but the edge cases and last mile issues with respect to the fault tolerance are where you take the hit.
Hadoop is a wonderful solution for distributed computation, and since our code is all purely functional, it is very natural for us.
Clojure is ideal for both the data wrangling, as well as the statistical and machine learning code. Also, Clojure plays nice with everything on the jvm so we wrap and use lots of libs from the java world. Put this together with the distributed compute infrastructure that you get with Hadoop, and it starts to make a lot more sense to build these systems in Clojure on the jvm and use Hadoop than it does to use R, python, etc.
That said, if you just need to do quick and dirty prototypes, or don't have the need or the option to invest in infrastructure for distributed computing, R and SciPy are probably still the place to turn.
At Google, the research scientists prototype in python and R, and then port to C++ for the real scalable map reduce runs. We prototype and run in production on the same language and platform, and although it is not as fast as Google's C++ infrastructure, we do have the benefit that Clojure is very high level. For large runs, we parallelize with Hadoop, and we just run smaller tasks locally as if our Hadoop infrastructure were not even there. We are not coupled to it, and yet we don't need to port code or do anything special to run it in distributed mode.
Can you talk more about how you are using Hadoop for Flightcaster? Are most of the jobs I/O intensive, preprocessing a large volume of historical input data, or are they more often cross-validation or simulation runs? Do you tweak your live models then re-train against all the historical data with Hadoop?
Our jobs are CPU intensive - we do a lot of computation per unit of data, even in our data transformation jobs.
We train and test all our models offline, on captured real time data.
Eventually we would like to move to a more incremental online learning approach, but for now we re-train and re-test against newly captured data and re-deploy new the newly trained model.
Any tips for handling bad data/records in large MapReduce jobs?
We've been talking to Chris Wensel and cascading has some really cool stuff for this in the form of "filters" that we are not leveraging yet. We do a lot of filtering and scrubbing of our own during the preprocessing phase, but we will be looking at what cascading can do for us here very soon.
The biggest lesson we have learned about bad data is that we should spend more time up front with visualizing the data and making assertions (or filtering via predicates) to verify that our assumptions about the data hold.
I have had these issues in the past with financial data, but not as bad. Can a flight land before it takes off? Are there 68 hours in a day? These are examples of data that is not malformed or missing, but that violates fundamental properties that you expect to hold true, so they can be trickier to catch.
The big problem with not spending more time on this up front is that these issues are expensive to catch downstream because you will be looking through non-intuitive results of complex analysis jobs and it takes a long time to track down the root of such issues. It is similar to the argument for having good unit tests; it is cheaper to find issues at that level than at the system test level. It is also more sanity-preserving.
You probably can't say much about your secret sauce or algorithms, but can you discuss some general issues you encounter handling conflicting information sources or incomplete data? Do ensemble approaches or online learning play a role?
As I mentioned in a previous response, we want to get into more of an incremental online learning approach, and I think we will be able to soon. Of course this means certain approaches are in and others are out, but that may be OK in our case.
As of right now, we are using a combination of analytical and inductive learning. We compose individual classifiers that are themselves composed of rich domain logic. This is not an ensemble approach, though we may head in that direction very soon.
What we are doing now is more of a network-of-classifiers approach. It is a bit of a strange beast right now, so we would be remiss to call it a Bayesian network. Maybe we could call it a Cardono network in honor of Jerolomo Cardano. It is a bit of an eccentric network that is ahead of its time but waiting for some formalism to come along and straighten it out.
To a large extent, our current approach is an artifact of how quickly we have built our initial model. We have these rich individual domain specific classifiers that are targeted at different features from different data sources, and the approach we are using to learn the composition of these individual classifiers is an area of active research.
What is your general style for attacking prediction problems? Do you start with a single data source and get the entire pipeline running with a simple model, or you dive in building a more complex model with multiple data sources? How did Flightcaster compare to financial prediction?
I like to let simpler cases drive out a lot of the demand for infrastructure early on. As you suggest, I tend to start by getting something working end-to-end with a simple model and a single data source. Then you can start to evolve faster, join in other data sources, and try more deeply theoretical ideas; all on top of a quality infrastructure.
That said, some infrastructure is not required until you get to more complex models, so it is always an evolutionary process where model drives infrastructure.
I made a lot of mistakes early in my career in building trading models where I let me theories get too far ahead of what I could really test in practice. That is not a good place to be. Unfortunately, this is an easy mistake to make.
Flightcaster has been very different from my work in investing in a number of ways. FlightCaster makes predictions by turning the probability distribution estimation problem into a k-way classification problem.
This is a simpler problem than designing holistic investment strategies.
In investing, you have to answer many questions. Selecting the point at which buying and selling occurs is akin to predictive problems in other domains. However, this only answers the question of when to buy or sell.
You also have to decide what to buy or sell - which is called portfolio selection. You also have to decide how much to buy or sell - which is called portfolio allocation. There are also the topics that I like to call risk and exposure accommodation.
The latter subjects of portfolio selection, portfolio allocation, and risk and exposure accommodation are arguably more important than the former subject of timing or predicting entry and exit points. Parameter sensitivity analysis tends to show that return and risk-return metrics are less sensitive to variability in the entry-exit approach as compared with portfolio selection, portfolio allocation, and risk accommodation approaches.
These aspects of financial modeling make it significantly more difficult, and they also seem to largely account for the repeated demise of many so-called "quantitative trading" operations.
Can you say anything about decision points and false alarm rates in travel delay prediction? It seems like there is a big penalty (risk) if somebody misses a flight based on a bad prediction, but a minor annoyance if the flight is delayed and you don't predict it correctly. Will you guys publish some ROC curves at some point?
There are always two key metrics in learning - in IR they are called precision and recall. We seem to be at somewhere around 85% precision and 60% recall. So discovering more delays is where we need to focus, and our false positive delay predictions are not the big issue for us right now.
We compute confusion matrices, and use them to derive precision, recall, and our false positive and false negative rates.
When you turn numerical probability distribution estimation into k-way classification, one side effect is that not all false positives are equal. If we way you will be delayed by over an hour, but you are delayed by 45 minutes, that not bad at all. But if we say you will be delayed by over an hour and you are on time, that is much worse. So there is a distance aspect to false positives.
We don't see a lot of bad false positives now, that is, we don't appear to tell people they will experience a long delay when in fact they will be on time. The bigger issue seems to be our recall - there are a lot of delays that we are just not able to detect yet. Alternatively stated, we have a problem with false negatives in the on time class.
We also have a strong temporal element to all this. How do these numbers change as we get closer or further away from your departure time? We're working on this analysis right now, and we do hope to publish a lot of useful metrics soon.
How was it working with YCombinator?
YCombinator is amazing. The people are all great; both YC founders and the people they invest in. The exposure we have gotten from being part of the program is itself worth the equity stake they take in the company. YC has put together a really fascinating business model. I would like to be an investor in YC.
Trevor Blackwell is a YC founder, friend, and maker of fine quality luxury robots. Actually, his big new thing is telepresence robots. Check out the bots at anybots.com.
Trevor is the one who lured me into working with the Flightcaster team. I hadn't met Paul Graham and his Wife Jessica before YC - they totally rock!
It has been an amazing experience and I would encourage anyone to apply; especially if you want to build something that solves a real problem.