Bloom Filters have become a standard in production systems that process larger datasets. Their widespread use , espically in distributed databases , comes from the effectiveness that they promise in use cases in which we need a hash table functionality for fast lookups and inserts, but do not have the luxury of space.They were invented in 1970s by Burton Bloom , but it actually "bloomed" in the last decade when need for space efficient systems increased, thanks to the boom of big datasets.

One simple way to think about Bloom filters is that they support insert and lookup in the same way the hash tables do, but using very little space, i.e., 1 byte per item or less. This is a significant saving when keys take up 4-8 bytes. Bloom Filers do not store the item themselves, and so it can load up efficiency on RAM , but, therefore also exhibits an small error rate which is bunch of false positves.

Note that bloom filters have false positives but they do not have false negatives.

So ,simply spoken, when the Bloom Filter reports that an item is "Present" , there is a small chance that item may not be present in the data structure object, but when it reports the item as "Not Present/Not Found" , it's definitely not present. So , in situations where the query answer is expected to be Not Present most of the time, Bloom filters offer great accuracy plus space-saving benefits.

How are Bloom Filters being used in Google's WebTable & Apache Cassandra

Google's WebTable and Apache Cassandra are among the most widely used distributed storage systems for handing massive data. So, it will be real interesting to get a sneak into how bloom filters are being utilized by these systems.

Namely, these systems organize their data into a number of tables called Sorted String Tables (SSTs) that reside on disk and are structured as key-value maps (a key might be a URL, and a value might be website attributes or contents, for example.). Cassandra and WebTable simultaneously handle adding new content into tables and answering queries, and when a query arrives, it is important to locate the SST containing the queried content without explicitly querying each table. So , to do that , dedicated Bloom filters in RAM are maintained, one per SST, to route the query to the correct table.

So, suppose if we are given 50 SSTs stored in disc which has one to one mapped 50 bloom filters loaded in RAM, for one SST as soon as the bloom filter reports the item as "Not Found" on a lookup, the query is routed to next bloom filter associated to next SST.

And , the first time a Bloom filter reports the item as "Present" in one SST, we go to disk to check whether the item is present in the table. In case of a false postive lookup, we continue routing the query until again a Bloom filter reports Present, and this time the data is also found on disk. If this case comes , data is returned back to the user, else the routing continues till the SST's lasts.

Similarly, Bloom filters are most useful when placed strategically in high-ingestion systems. For example, having an application perform an SSD/disk read/write can easily bring down the throughput of an application from hundreds of thousands ops/sec to only a couple of thousands or even a couple of hundreds ops/sec. Instead, if we place a Bloom filter in RAM to serve the lookups, we can avoid this performance slump, and maintain consistently high throughput across different components of an application. 


Massive DataSet Algorithms Applications