Probabilistic Data Structures For Web Analytics And Data Mining

pjscott · on May 2, 2012

This is probably the most readable article I've seen on the topic. If the author is reading this: your article is good and you should feel good.

One more fun thing about Bloom filters: generalized, they're not just a way of testing for set membership. They're a way of getting an upper bound on some function of a finite input set. For example, suppose that I have a set of a million names and how tall that person is, and I want a data structure that will give me an upper bound on the height of a person given the name. I know, this is totally contrived, but here's what you could do:

1. Make an array of height values, initially zero. Call it heights.

2. For each name, get several hash values: h1, h2, ... hn.

3. For each hash h, set heights[h] = max(heights[h], height). In other words, you're only pushing the height values upward.

4. To get an upper bound on a person's height, hash their name and look at your heights array. The upper bound is min(heights[h1], heights[h2], ... heights[hn]).

(If you want to be very technical, this works not only for numbers, but for any set that supports max() and min() operations. A Bloom filter for set membership uses such a set with two elements, 0 and 1.)

thecroat · on May 1, 2012

Hey eeeverrrybody. So if my brain melted, and I only understood about half of this, where do I got to learn more to better understand this? I love this stuff, but my melon can only retain/understand so much without going insane.

davidcuddeback · on May 2, 2012

Try implementing some of the data structures and throw some toy problems at them. That's a really good way to get a feel for them. I've done that with AI algorithms, and I found that it really helps solidify the concepts because sometimes the results don't make sense and I have to figure out "why". Implementing the algorithms forces you to do a deep dive, but at a pace that hopefully won't fry your brain. Don't overlook the papers that the author linked to at the bottom if the article.

Edited to fix my phone's auto-corrections.

mwexler · on May 1, 2012

What a great read. I am always curious what might be a good user experience that lets a user select "time" vs. "precision/accuracy" in an interface. A slider comes to mind, but has anyone seen a clever implementation that lets users choose how they want an answer delivered?

rxin · on May 1, 2012

Check this out: http://blinkdb.org/

A new research collaboration between UC Berkeley and MIT on probabilistic query answering. It allows users to specify trade-off between error confidence bound vs time.

rickmode · on May 1, 2012

"BlinkDB can execute a range of queries over a real-world query trace on up to 17 TB of data and 100 nodes in 2 seconds, with an error of 2–10%"

17 TB on 100 nodes? That a lot of nodes to hold 17 TB; on average 174 GB-ish of data each.

The speed is super impressive, but using 100 nodes makes this look more like a parallel processing achievement than "big data".

rxin · on May 2, 2012

Even with parallel processing, assuming you can handle 1G of data per node per second (which is a fairly optimistic estimation for on disk data):

1G/node/sec * 100 nodes * 2 sec = 200G.

17TB is 85 times that number.

Estragon · on May 2, 2012

These techniques are quite sensitive to the distribution of the data under study. For instance, if the numbers being processed by the LogLog counter aren't uniformly distributed, the cardinality estimates you get from it could be arbitrarily biased.