Bloomfilters

From time to time I read something and the term “bloomfilter” comes up. This weekend I wanted to know what actually is a bloomfilter. For a more thorough explananation go to wiki. What follows is my “laymens” explanantion.

If you have a set of values, and you want to check whether items in a second set are in the first set, you can make use of a bloomfilter.  I suppose the easiest way to understand the how and the why is I just explain what happens constructing and using the bloomfilter.

Some Characteristics o a bloomfilter:

  • Relatively small memory footprint
  • It will NEVER deliver a false negative (It will not discard a value of  the second set if it actually is in the first set)
  • It may deliver some false positives. A value that is in set 2 will might be selected although it’s not in the first set.

OK…let’s see how this works:

  1. Define the datasets
    1. The first set may look like : 2, 5 This is the  set we are going to test against.
    2. The second set may look like 2, 6, 7  (are values of 2nd in the 1st set)
  2. Constructing the Bloomfilter
    1. Define a array of bits (all zero), let’s say 10 bits width. In practice the array will be much wider. (the filter looks like : 0000000000)
    2. Define a number of hashes, let’s say 3 hashses. If a value is fed to a hash it will return a integer, representing the number of a bucket in the filter. Our filter is 10 bits width, so every hash will generate a value between 1 and 10 (including).
    3. Process the first value of the first set (2)
      1. Feed this value to hash 1. Let’s say the hash return 4. So the fourth bit of the filter will be set to 1 (0001000000)
      2. Feed this value to hash 2. Let’s say the hash return 8. So the eight bit of the filter will be set to 1 (0001000100)
      3. Feed this value to hash 3. Let’s say the hash return 5. So the fifth bit of the filter will be set to 1 (0001100100)
    4. OK…we’re done with the first value, let’s go to the second value : 5
      1. Feed this value set to hash 1. Let’s say it returns 8. So the eight bit of the bloom filter will be set to 1 (0001100100). Actuallty this bit was allready set….no problem…it stays 1
      2. Feed this value to hash 2. Let’s say it return 10. So the tenth bit of the filter will be set to 1 (0001100101).
      3. Feed this value to hash 3.  Let’s say it return 1. So the first bit of the filter is set to 1 (1001100101)
    5. Done with the second value …. in a real example there would be many many more values to feed to the filter….not here 🙂
  3. Now we want to test whether a value in the second set is in the first set.
    1. Start with the first value of the second set: 2
      1. Feed the value into the 3 hashes. The 3 hashes return 4, 8 and 5 (a hash will return the same output every time you feed it the same input)
      2. Test whether bits 4,8 and 5 are set in the bloomfilter
        1. If all 3 are set…then YES this value is (probably) in the first set
        2. If not all 3 postions in the bloomfilter are set, then this value does not belong to the first set.
        3. Bit 4, 8 and 5 are set….Yes value 2 will probably be in the first set
    2. Now the second value of the seond set : 6
      1. Feed the value 6 to the 3 hashes. Let’s say that they return 5, 6 and 9.
      2. Test whether  the bits in the bloomfilter on position 5, 6  and 9 are set.
        1. If all 3 are set…then YES this value is (probably) in the first set
        2. If not all 3 postions in the bloomfilter are set, then this value does not belong to the first set.
        3. Bit 5 is set. Bit 6 and 9 are not set. This value (6) is defenitely not be in the first set.
    3. Now the 3th value  (7)
      1. Feed the the value to the 3 hashes. Let’s say that they return the values 1, 8 and 10
      2. Test whether the bits in the bloomfilter on position 1, 8 and 10 are set.
        1. If all 3 are set…then YES this value is (probably) in the first set
        2. If not all 3 postions in the bloomfilter are set, then this value does not belong to the first set.
        3. Bit 1, 8 and 10 are set, so YES (???) the value 7 is proably in the first set. And there is a important detail of the Bloomfilter. It guarantees that it will select the values which are definetly in the first set, and it might let some values pass (false positives).

That’s it. I was thinking it is a very, very difficult process to understand, nut actually…I do understand it.

Leave a Reply

Your email address will not be published. Required fields are marked *