From time to time I read something and the term “bloomfilter” comes up. This weekend I wanted to know what actually is a bloomfilter. For a more thorough explananation go to wiki. What follows is my “laymens” explanantion.
If you have a set of values, and you want to check whether items in a second set are in the first set, you can make use of a bloomfilter. I suppose the easiest way to understand the how and the why is I just explain what happens constructing and using the bloomfilter.
Some Characteristics o a bloomfilter:
- Relatively small memory footprint
- It will NEVER deliver a false negative (It will not discard a value of the second set if it actually is in the first set)
- It may deliver some false positives. A value that is in set 2 will might be selected although it’s not in the first set.
OK…let’s see how this works:
- Define the datasets
- The first set may look like : 2, 5 This is the set we are going to test against.
- The second set may look like 2, 6, 7 (are values of 2nd in the 1st set)
- Constructing the Bloomfilter
- Define a array of bits (all zero), let’s say 10 bits width. In practice the array will be much wider. (the filter looks like : 0000000000)
- Define a number of hashes, let’s say 3 hashses. If a value is fed to a hash it will return a integer, representing the number of a bucket in the filter. Our filter is 10 bits width, so every hash will generate a value between 1 and 10 (including).
- Process the first value of the first set (2)
- Feed this value to hash 1. Let’s say the hash return 4. So the fourth bit of the filter will be set to 1 (0001000000)
- Feed this value to hash 2. Let’s say the hash return 8. So the eight bit of the filter will be set to 1 (0001000100)
- Feed this value to hash 3. Let’s say the hash return 5. So the fifth bit of the filter will be set to 1 (0001100100)
- OK…we’re done with the first value, let’s go to the second value : 5
- Feed this value set to hash 1. Let’s say it returns 8. So the eight bit of the bloom filter will be set to 1 (0001100100). Actuallty this bit was allready set….no problem…it stays 1
- Feed this value to hash 2. Let’s say it return 10. So the tenth bit of the filter will be set to 1 (0001100101).
- Feed this value to hash 3. Let’s say it return 1. So the first bit of the filter is set to 1 (1001100101)
- Done with the second value …. in a real example there would be many many more values to feed to the filter….not here 🙂
- Now we want to test whether a value in the second set is in the first set.
- Start with the first value of the second set: 2
- Feed the value into the 3 hashes. The 3 hashes return 4, 8 and 5 (a hash will return the same output every time you feed it the same input)
- Test whether bits 4,8 and 5 are set in the bloomfilter
- If all 3 are set…then YES this value is (probably) in the first set
- If not all 3 postions in the bloomfilter are set, then this value does not belong to the first set.
- Bit 4, 8 and 5 are set….Yes value 2 will probably be in the first set
- Now the second value of the seond set : 6
- Feed the value 6 to the 3 hashes. Let’s say that they return 5, 6 and 9.
- Test whether the bits in the bloomfilter on position 5, 6 and 9 are set.
- If all 3 are set…then YES this value is (probably) in the first set
- If not all 3 postions in the bloomfilter are set, then this value does not belong to the first set.
- Bit 5 is set. Bit 6 and 9 are not set. This value (6) is defenitely not be in the first set.
- Now the 3th value (7)
- Feed the the value to the 3 hashes. Let’s say that they return the values 1, 8 and 10
- Test whether the bits in the bloomfilter on position 1, 8 and 10 are set.
- If all 3 are set…then YES this value is (probably) in the first set
- If not all 3 postions in the bloomfilter are set, then this value does not belong to the first set.
- Bit 1, 8 and 10 are set, so YES (???) the value 7 is proably in the first set. And there is a important detail of the Bloomfilter. It guarantees that it will select the values which are definetly in the first set, and it might let some values pass (false positives).
- Start with the first value of the second set: 2
That’s it. I was thinking it is a very, very difficult process to understand, nut actually…I do understand it.