Analytics: Analyzing tracked events

07 Feb 2013

In previous article, I've shown two essential ways the events get tracked.

Now, I would like to focus on the latter event of tracking — storing all the event data, and analyzing it ad-hoc afterwards.

It is important to analyze data in the fastest fashion, while preserving the power. That's the main concern of this post.

We'd explore more about that using the shop tracking system as an example. It would track only one kind of event — orders. To keep the samples simple, we'll only calculate these metrics:

  • average purchase total
  • top products

The structure of the event document would be as follows:

  • total — float order total
  • line items — array with elements like:
    • sku — string product sku
    • price — float item price

For the sake of this example, we will perform different analysis methods against a set of 1,000,000 of such order events, with each having 5 line items on average.

We'll use MongoDB as schema-less documents are the fit for storing events and also because of its powerful analytics functionalities. Worth noticing that those features will mainly be described in light of their application to described tasks.

Sample sources

The naive way

The most obvious thing to do is to query all the events and perform our analysis in Ruby.

def average_purchase
  all.map(&:total).sum / all.count
end

def top_products
  all.map(&:line_items).flatten.group_by(&:sku).map { |i| [i.first, i.last.count] }.sort_by(&:last).reverse
end

While it seems pretty simple, it does in no way seem performant.

naivetime
avg purchase119,775ms
top productscanceled after 2+ hours of waiting
totalcanceled after 2+ hours of waiting

And not only it is being slow, it is also being highly RAM-consuming. The overhead of transmitting 1M documents from Mongo to Ruby process and then storing that in RAM is hurting.

Evaluating JS code on MongoDB server

What was the worst point of previous case? Right, transmitting the data and storing it in memory.

If there only was a way to evaluate arbitrary code on MongoDB side… Well, there is. It's called db.eval().

Using it, we can execute any JavaScript as if we were running it inside mongo. It effectively reduces the overhead of transmitting and introduces overhead of running a JavaScript VM.

def average_purchase
  result = collection.database.command("$eval" => <<-EOJS)
    var orders = db.order_events.find(),
        order_count = orders.count(),
        total_sum = 0;

    orders.forEach(function(order) {
      total_sum += order.total;
    });

    return total_sum / order_count;
  EOJS

  result['retval']
end

def top_products
  result = collection.database.command("$eval" => <<-EOJS)
    var orders = db.order_events.find(),
        products = {};

    orders.forEach(function(order) {
      order.line_items.forEach(function(item) {
        var sku = item.sku;
        if (!products[sku]) products[sku] = 0;
        products[sku] += 1;
      });
    });

    return products;
  EOJS

  result['retval'].to_a.sort_by(&:last).reverse
end

It is more of the code, but the results are better:

db.eval()time
avg purchase8,228ms
top products42,399ms
total50,627ms

The disadvantage of using it is the write lock being placed while it's running and overhead of executing it in JavaScript VM.

Map/reduce

Map/reduce is a programming model for processing large data sets. One of highlights is distributing processing of computations across the cluster. Not only that, it also has the benefits of db.eval(). You can learn more about how MongoDB implements it here.

Worth noticing that map/reduce is not always a suitable substitution for db.eval(). The tasks we are trying to accomplish, on another hand, are the perfect match for using it.

def average_purchase
  map = "function() { emit('avg', this); }"

  reduce = %Q{
    function(key, values) {
      var result = { sum: 0, count: 0 };
      values.forEach(function(value) {
        result.sum += value.total;
        result.count += 1;
      });
      return result;
    }
  }

  finalize = %Q{
    function(key, value) {
      value.avg = value.sum / value.count;
      return value;
    }
  }

  map_reduce(map, reduce).out(inline: 1).finalize(finalize).to_a.first['value']['avg']
end

def top_products
  map = %Q{
    function() {
      this.line_items.forEach(function(item) {
        emit(item.sku, item);
      });
    }
  }

  reduce = %Q{
    function(key, values) {
      var result = { purchases: 0 };
      values.forEach(function(value) {
        result.purchases += 1;
      });
      return result;
    }
  }

  items = map_reduce(map, reduce).out(inline: 1).to_a
  items.map { |doc| [doc['_id'], doc['value']['purchases']] }.sort_by(&:last).reverse
end

Speed was generally worse compared to db.eval(). It is because map/reduce splits tasks in two phases. If map/reduce is run on a shard cluster, tasks would be dispatched to each shard, thus making it faster. (Map/reduce wasn't meant to be damn fast anyway.)

map reducetime
avg purchase20,498ms
top products112,035ms
total132,534ms

It also suffers from the same thing as db.eval() does — map/reducing places the write lock while it runs.

Aggregation framework

While map/reduce is truly awesome, the amount of code to aggregate simple things is often overwhelming.

Assuming you're familiar with UNIX shell you would also wonder why can't you easily chain different operations in map/reduce? What if you wanted to do something like:

match_criteria | sort | limit | group

Doing that with map/reduce would be a pain in ass.

Ok, that sucks, but what are the alternatives? Well, MongoDB 2.1 and higher ships with something called "aggregation framework". Essentially, it is chainable map/reduce with common use cases implemented and optimized upfront. It is not a tutorial on it, so refer to the docs when you need. I'd just go by showing how neat it is.

def average_purchase
  collection.aggregate([
    {"$group" => {"_id" => "avg", "avg" => {"$avg" => "$total"}}}
  ]).first['avg']
end

def top_products
  collection.aggregate([
    {"$unwind" => "$line_items"},
    {"$project" => {"item" => "$line_items"}},
    {"$group" => {"_id" => "$item.sku", "purchases" => {"$sum" => 1}}},
    {"$sort" => {"purchases" => -1}}
  ]).map { |doc| [doc['_id'], doc['purchases']] }
end

Running it, we can notice great speed improvement compared to any of previously discussed ways:

aggregationtime
avg purchase2,511ms
top products29,146ms
total31,658ms

Possible optimization techniques

Whatever method you end up using, it is wise to apply some optimization to metric calculation.

Computing your metrics on every access to them would be super-dumb. Instead, once calculated, their value should be cached and always returned unless the metric dependencies change.

In our examples, caches should be invalidated only when a new order is tracked.

Comparison

naivedb.eval()map/reduceaggregation
avg purchase119,775ms8,228ms20,498ms2,511ms
top productscanceled after 2+ hours of waiting42,399ms112,035ms29,146ms
totalcanceled after 2+ hours of waiting50,627ms132,543ms31,658ms
speedexceptionally lowhighmediumsuper high
memory usagehighlowlowlow
data transfer overheadyesnonono
JS VM overheadnoyesyesyes
write locknoyesyesyes
easily distributednoyesyes
optimizednoyes
output limit16Mb16Mb
chainablenoyes

Conclusion

Using MongoDB aggregation framework proves to be the fastest option you have. For sure, it is powerful enough to perform simple as well as more advanced analysis of data. It scales well and is simple to write.

If something more sophisticated needed than it's possible to do with aggregation framework, you may use map/reduce, trading off the performance and loosing easy chainability. Scalability still remains the point.

If your task is really simple but is impossible to do with aggregation framework or is very slow under map/reduce, it might be desirable to give db.eval() a shot. Cons? Doesn't scale.

Querying all the documents and performing the analysis right in your programming language is the last resort. It's a loose by all means and generally it should be used only when neither of other options leads to desired result.

Sample sources

I am not the expert at analytics, just the curious one. If I got something in a wrong way and/or advice the bad thing, feel free to let me know about that.

Discuss this post on HN.

Think your friends would dig this article, too?

Google+
If you need a mobile app built for your business or your idea, there's a chance I could help you with that.
Leave your email here and I will get back to you shortly.