09 February 2009

Last month, our team at Juice launched a Django web analytics app called Concentrate that ingests search queries from sources like Google Analytics or Hitwise, then enhances this raw data by discovering common query patterns, generating segmented reports, and offering visual interfaces for data exploration. Jeff Barr wrote about the technology stack we used to build the app itself a couple of weeks ago at the AWS blog. I'll provide some more detail on that topic later this week. This post will give a basic description of Concentrate's pattern discovery algorithm and show it in action.

The following mashup provides a visual interface for exploring search patterns used by readers of the Data Wrangling blog by combining output from concentrateme.com with the Google AJAX search API. Each bubble in the visualization below represents a search query typed into Google during the last 2 months that led to clicks on on this site (~2000 unique queries, ~3400 searches). The size of each bubble represents the number of visitors referred by that particular query, and the bubbles are colorized by the query cluster based on phrase pattern structure ('python [x]', [x] video', etc). The search results below the chart are highlighted in yellow if they lead to datawrangling.org pages, which allows you to see at a glance where the site ranks for each query.

Search map of queries leading to clicks on datawrangling.org

Click to open the query browser in a new window, then mouse over a query bubble and click to update the search results.

Interactive Search Query Map

Search query referrals from Google depend on a number of factors; including what content you have, how well it ranks in Google, and how often people actually search for it. I'm not the most prolific blogger, so the topical coverage seen in the chart is somewhat sparse. Some alternate views which can offer additional insight include sizing the bubbles by engagement metrics like average time on site instead of visit count. You can download the raw data here if you want to experiment further: datawrangling_dec_jan.csv

Pattern discovery in Concentrate grew out of discussions I had with clients about current pain points in web analytics. A common theme seemed to be that they had large amounts of internal and competitive search referral data, but found managing and deriving insights from the data to be difficult using existing tools. My first instinct was to give them a summary view and trend reports by clustering queries based on topic using methods like NMF (I'll be doing a few posts on topic classification later this month), but it turned out that the use cases they described were better served by another approach which automatically discovered common text patterns in the data and segmented queries based on phrase structure.

What I'm calling "patterns" are really regular expression templates for searches that share a similar structure. For instance, the pattern “jobs in [x]” represents searches for jobs in some location. The “[x]” is a wildcard that can stand for one or more words. These wildcards are often variants of a similar concepts like locations, brands, or celebrity names. As it turned out, a number of web analytics teams I talked to were spending a lot of time mining this kind of information manually from their search data in an effort to get a picture of their search traffic.

Clustering searches based on phrase similarity is a problem that people have looked at before in fields like question answering. Several interesting papers are referenced in my delicious clustering tag if you want to dig deeper. I think the novel part in Concentrate is that it combines a custom distance metric with an iterative algorithm that alternates between clustering queries and extracting common patterns within those clusters in a scalable fashion. Since the algorithm is part of a commercial service I can't go into much more detail about how this all works, but hopefully you get the general idea of how this kind of pattern discovery can be useful.

The original idea for the interactive query scatter plot was inspired by a Netflix Prize data visualization at A Beautiful WWW. A similar analysis was run by Mark Reid using book borrowing data at Inductio ex Machina.

To generate the scatter plot, I tied my hands a bit and only allowed myself to use the CSV files downloaded from Concentrate as an input to the mashup (an API may be on the horizon). I made similar plots to aid in development and debugging of the clustering & pattern extraction algorithms, but this version is nice because you don't need any extra information to generate it yourself. I computed inter-query distance for all pairs of query strings in the file using a string edit distance metric based on a combination of the queries and the phrase patterns labels discovered by Concentrate (also in the CSV).

The scatter plot layout was generated by running multidimensional scaling (MDS) on the resulting full distance matrix. The MDS approach consisted of applying the SVD function built into NumPy to the distance matrix to find the first two basis vectors, which were then used as X-Y coordinates. To produce the actual scatterplot html imagemap I used Matplotlib and smattering of Python code from the following links:

This was just intended as a quick one-off example, but if anyone out there wants to generate the same visualization using another site's search traffic, feel free to contact me. If you have Google Analytics on your site, you can sign up for a free Concentrate account and email me a CSV report containing your pattern clusters. If you want to find out more about how you can apply search patterns for web analytics applications, check out these posts from the Juice blog: