This is probably the most readable article I've seen on the topic. If the author is reading this: your article is good and you should feel good.
One more fun thing about Bloom filters: generalized, they're not just a way of testing for set membership. They're a way of getting an upper bound on some function of a finite input set. For example, suppose that I have a set of a million names and how tall that person is, and I want a data structure that will give me an upper bound on the height of a person given the name. I know, this is totally contrived, but here's what you could do:
1. Make an array of height values, initially zero. Call it heights.
2. For each name, get several hash values: h1, h2, ... hn.
3. For each hash h, set heights[h] = max(heights[h], height). In other words, you're only pushing the height values upward.
4. To get an upper bound on a person's height, hash their name and look at your heights array. The upper bound is min(heights[h1], heights[h2], ... heights[hn]).
(If you want to be very technical, this works not only for numbers, but for any set that supports max() and min() operations. A Bloom filter for set membership uses such a set with two elements, 0 and 1.)
Hey eeeverrrybody. So if my brain melted, and I only understood about half of this, where do I got to learn more to better understand this? I love this stuff, but my melon can only retain/understand so much without going insane.
Try implementing some of the data structures and throw some toy problems at them. That's a really good way to get a feel for them. I've done that with AI algorithms, and I found that it really helps solidify the concepts because sometimes the results don't make sense and I have to figure out "why". Implementing the algorithms forces you to do a deep dive, but at a pace that hopefully won't fry your brain. Don't overlook the papers that the author linked to at the bottom if the article.
What a great read. I am always curious what might be a good user experience that lets a user select "time" vs. "precision/accuracy" in an interface. A slider comes to mind, but has anyone seen a clever implementation that lets users choose how they want an answer delivered?
A new research collaboration between UC Berkeley and MIT on probabilistic query answering. It allows users to specify trade-off between error confidence bound vs time.
These techniques are quite sensitive to the distribution of the data under study. For instance, if the numbers being processed by the LogLog counter aren't uniformly distributed, the cardinality estimates you get from it could be arbitrarily biased.
One more fun thing about Bloom filters: generalized, they're not just a way of testing for set membership. They're a way of getting an upper bound on some function of a finite input set. For example, suppose that I have a set of a million names and how tall that person is, and I want a data structure that will give me an upper bound on the height of a person given the name. I know, this is totally contrived, but here's what you could do:
1. Make an array of height values, initially zero. Call it heights.
2. For each name, get several hash values: h1, h2, ... hn.
3. For each hash h, set heights[h] = max(heights[h], height). In other words, you're only pushing the height values upward.
4. To get an upper bound on a person's height, hash their name and look at your heights array. The upper bound is min(heights[h1], heights[h2], ... heights[hn]).
(If you want to be very technical, this works not only for numbers, but for any set that supports max() and min() operations. A Bloom filter for set membership uses such a set with two elements, 0 and 1.)