Skip to content

Commit 13cc12f

Browse files
author
Joseph Luce
authored
Create bloom_filters.md
1 parent eea2efa commit 13cc12f

File tree

1 file changed

+38
-0
lines changed

1 file changed

+38
-0
lines changed

system_design/bloom_filters.md

+38
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,38 @@
1+
# Bloom Filters
2+
3+
## Use Case
4+
Given a word, figure out if it already exists or not.
5+
6+
On a system design level, a hash table can work.
7+
However, if you have a lot of words, say a billion+ words, you start running into performance issues.
8+
You cannot store this in memory and so there will be some overhead with disk input output and storage.
9+
You could try to optimize as much as you can, like sharding the data into buckets with sub-hash tables but this doesn't 100% solve the latency issue.
10+
11+
This is where bloom filters come in, is it a popular usage for databases.
12+
If you imagine an API like check(word) and it returns True or False.
13+
However, the API is probabilistic, if it gives you a False it is 100% accurate, if it returns True its 90% accurate, more or less, depends.
14+
The difference is that bloom filter uses a lot less memory than the hash table method.
15+
16+
## How it works
17+
1. Starting with a bit array of a set size, say 00000000 of 8 bits.
18+
2. Given a word, "cat", we will run this past multiple hash functions, each hash function outputs an index.
19+
For example, two hash functions hash1('cat') and hash2('cat') gives us two indexes 2 and 5.
20+
We will then set the bits to 00100100 in respect to its indexes.
21+
3. Then given another word, "dog", we will run it past the hash functions as well, giving us indexes 7 and 2.
22+
Again, setting the bit array accordingly to 00100101.
23+
4. If we wanted to check if the word "bird" exists, we would run it past the hash functions, for example it would return indexes 5 and 1.
24+
Since index 1 isn't set, we know "bird" does not exist.
25+
5. Simiarly if we tried another word, like "lion" and the hash functions returned 2 and 7, the API would believe that the word "lion" exists but we never saved it.
26+
27+
This is why bloom filters will always accurately return if something doesn't exist but fail to 100% predict if a word does exist.
28+
To increase the likelihood that it is correct, bloom filters will use many hash functions, this is to increase the chances to find more indexes containing zeros.
29+
30+
Lastly, since the bloom filters use a bit array, we can store the bit array as a string, each character containing 8 or 16 or 32 bits dependings on your operating system.
31+
Which results in something like A90bhl158, this can represent all the set bits in a condensed manner.
32+
33+
## Limitations
34+
Bloom filters require a rough estimate of how many unique elements would be stored as it would require the bit array to be determined beforehand.
35+
Once the bit array is set, it will be hard to change it.
36+
Simiarly, once we add an element into the bit, it will forever be added and can never be removed.
37+
However, there is something called an invertible bloom filter, which can be used to determined which bits to remove.
38+
I won't be discussing this topic here as it shouldn't be needed for interviews.

0 commit comments

Comments
 (0)