Probabilistic data structures are data structures used to store data in a way that helps us find answers to certain types of queries quickly and accurately. These data structures are probabilistic because they are not guaranteed to give an exact answer, but can provide an approximate answer quickly. Also, since these data structure optimizes certain types of queries, there are a few trade-offs in each probabilistic data structure. Examples of probabilistic data structures include Bloom filters, HyperLogLog, Count-Min Sketch, and Trie. In this article, we will look into bloom filters in detail.

Bloom Filters are a space-efficient probabilistic data structure used to quickly determine whether an element is present in a dataset or not. This is done by encoding a set of elements into a bit array of size n and then using a set of k hash functions to map each element to one or more positions in the array.

Consider an element x, hashed k times to unique k positions in the bit array, and this is repeated for all the elements which need to be searched upon. Now, when you want to check if an element is present in the bloom filter, you simply hash the element k times and see if none of the positions has 0. In case 0 is present in any of the hashed positions, it guarantees that the element is not present in the bloom filter. Please note that the reverse the not true all the time ie. if all the hashed position has 1, it does not guarantee that the element is necessarily present in the bloom filter, because, technically, there could be another element, which is also hashed to the same indices, this is the trade-off of using bloom filter.

Applications of Bloom Filter:

  1. Recommendation Engines: Medium uses bloom filters for article recommendations. Bloom filter may not recommend an article, that a user has not read before, but it will never recommend an article that a user has read before.
  2. Saving Expensive Disk seeks: Cassandra uses Bloom filters to determine whether an SSTable has data for a particular partition. Databases like Google Big Table use Bloom filters to reduce the disk lookups for non-existent rows or columns.
  3. The Google Chrome web browser used to use a Bloom filter to identify malicious URLs. Any URL was first checked against a local Bloom filter.

Disadvantages of Bloom filter:

  1. False Positives: Bloom filters provide no guarantee that a false positive will not be returned. This means that it is possible for a Bloom filter to indicate that an element is present in the set when in reality it is not.
  2. Elements can be deleted from the bloom filter, since multiple elements can be hashed on a particular position, if you want to remove an element from the bloom filter, you cannot simply turn the hashed positions to 0. However, there is an advanced version of bloom filter called Cuckoo filter which supports, deleting existing items.
  3. Complexity: Bloom filters are more complex to implement and require more computing power than other data structures like hash tables.

Resources:

Bloom filter - Wikipedia
Cuckoo filter - Wikipedia
What are Bloom filters?
A tale of code, dinner, and a favour with unexpected consequences.