A decent stand-alone Java Bloom Filter implementation

Note for the impatient: If you are just looking for the code, find it here, but please leave a comment if you are using it.

They say necessity is the mother of invention, and a few days ago I needed a stand-alone Java Bloom Filter implementation. A Bloom Filter is a relatively simple way to efficiently remember whether you’ve seen something before, without actually having to store everything you’ve seen. Wikipedia can fill you in on the details.

Anyway, I did some hunting around, and the best available option seemed to be this implementation in the Apache Hadoop project.

Fortunately, I took a look at the source code before going ahead and using it. The first warning sign was that at the heart of their implementation they use an array of boolean values (boolean[]). This is a serious problem, because most Java implementations I know will use a byte to represent a boolean in a boolean array. This means that for every actual bit stored, there are seven wasted bits – and this is in a datastructure that is supposed to be saving you memory!

Some other parts of their code also concerned me, such as how they were using SHA1 as their hash function, it seemed somewhat haphazard to say the least. Oh, the other minor problem was that in my initial testing, I couldn’t actually get their implementation to work.

So, I decided to implement my own, and as luck would have it I hit on a very simple method to do it, using the java.util.Random class. The two key methods, add() and contains(), are 7 and 8 short lines of very simple readable code respectively.

Of course, crypto geeks will point out that java.util.Random is not a very secure hash function, but the worst that can happen is an increased false-positive rate, and I think even this is unlikely for most applications (its certainly good enough for anything I’ve ever needed a Bloom Filter for).

Some other nice features:

  • Implements Java’s generic Set interface, since this should be familiar to people
  • You just tell its constructor how many elements you expect to insert into it, and it will automatically configure itself for optimal operation
  • It will calculate its expected false-positive rate for you
  • Its serializable
  • I’ve included a reasonably thorough unit test
  • Unlike the Hadoop implementation, SimpleBloomFilter uses BitSet, which should be nice and compact
  • I’m releasing it under a very liberal “do what you want, just remember who wrote it” license

Here is a simple example usage:

// Create a bloom filter 100 bits in size to contain Strings, optimized to
// contain 4 items
SimpleBloomFilter mammals = new SimpleBloomFilter(100, 4);

// Add some things to the bloom filter
mammals.add("dog");
mammals.add("cat");
mammals.add("mouse");
mammals.add("dolphin");

// Test to see if the bloom filter remembers
if (mammals.contains("dog")) {
	System.out.println("A dog is a mammal with probability "
		+ (1 - mammals.expectedFalsePositiveProbability()));
} else {
	System.out.println("A dog is definitely not a mammal");
}

If you are wondering how many bits to use, here is a table of the false-positive rates you should expect given specific ratios between number of items, and number of bits.

Ratio (items:bits) False-positive probability
1:1 0.63212055882856
1:2 0.39957640089373
1:4 0.14689159766038
1:8 0.02157714146322
1:16 0.00046557303372
1:32 0.00000021167340
1:64 0.00000000000004

Download the source code to the SimpleBloomFilter, and a unit test here. If you would like to review the code, you can do so here with nice syntax highlighting.

Update: Rob correctly points out that the implementation of hashCode() for a number of Java objects is rather lame, including the Integer class, leading to higher-than-expected false-positive rates. In such circumstances, he suggests using something like this so that you do a better job in your own .hashCode() implementation:

static java.security.MessageDigest md5;
static {
	try {
		md5 = java.security.MessageDigest.getInstance("MD5");
	} catch (NoSuchAlgorithmException e) {
	}
}

public static int hashCode(int val) {
	int h = 0;
	// create a strong, 32-bit hash:
	byte[] valBytes = new byte[] { (byte) val, (byte) (val >> 8),
			(byte) (val >> 16), (byte) (val >> 24) };

	byte[] res = md5.digest(valBytes);
	for (int i = 0; i < 4; i++) {
		h <<= 8;
		h |= ((int) res[i]) & 0xFF;
	}

	return h;
}

Update #2: Rob also drew my attention to another issue, which is that in making SimpleBloomFilter Serializable, I’m assuming that the .hashCode() methods on objects will always return the same value for the same object within different executions, or in different JVMs on different platforms, however the Javadoc for the Object.hashCode() method explicitly does not make this commitment. If this is a concern for you, I recommend providing your own .hashCode() implementation, perhaps based on the code above, which does meet this criteria.

If someone cares to modify my code to address these shortcomings, and release them under a similarly liberal license, I’d be happy to link to their code.

Advertisements

15 Responses to “A decent stand-alone Java Bloom Filter implementation”

  1. mjimeno Says:

    hello Ian, I’m interested in using your implementation, I promise I will give full credit =D, but, … I recommend you to publish a “run my code for dummies” so that people willing to use it (like me) don’t get bored trying to know how to do it. I hope you accept this as nice recommendation. So, could you please post a step by step on how to run your code? Thanx!
    Miguel

  2. ian Says:

    Hey Miguel,

    The unit test file should provide a good example of how to use the class, but I will add a quick example usage above too.

  3. laz Says:

    Nice work, your implementation looks good. My only recommendation would be to use
    java.lang.UnsupportedOperationException in place of your custom exception type. In general it is a best practice to reuse appropriate exception types already defined as part of the Java API. See
    http://java.sun.com/j2se/1.5.0/docs/api/java/lang/UnsupportedOperationException.html

    Cheers,
    Larry

  4. ian Says:

    Thanks laz, you are right about UnsupportedOperationException, I can never remember what its called, thanks for reminding me. I’ve made the change.

  5. Inigo Says:

    Hi Ian,

    Thanks for your Java implementation. I’ve written up a conversion of it to F# on my company blog at:

    http://67bricks.com/blog/2008/03/19/bloom-filter-implementation-in-f/

    (F# is an interesting language – in many ways, it’s the .NET equivalent of Scala)

  6. ian Says:

    Hi Inigo, that is cool. I’ve heard of F#, I understand its similar to ML. ML was developed at the University of Edinburgh where I studied Comp Sci, and much of our coursework used it, so I grew pretty familiar with it, although I’m a bit rusty these days. Scala is refreshing my memory about functional programming though.

  7. Rob Says:

    A few comments on this…

    It should really not be declared as a Java Set. The Java collection interfaces have some unfortunate design decisions, and the ones made for Sets, Maps, and Lists limit the valid implementations. A Set must implement particular, exact semantics for equals() and hashCode(), and existing code really does depend on this. The essential lack of an iterator and other operations means that maybe this should not be viewed as a Set implementation, anyway.

    Statistically, the use of the sequence output from the Random generator is problematic. It results in a hash no more unique than the seed, since the sequence is fully determined by that.

    Using the hashcode is itself a problem. It should be made clear that this is only a set of hashcodes of instances. Hashcodes are notoriously bad and not at all uniformly distributed, including for strings and primitive types. Many types have the allowed weak hashcodes with frequent collisions. A very common pattern is to have a mutable type whose instances have stable hashcodes based on some part of their data, but whose equality depends on their current state. All hashcode collisions are indistinguishable in this type, which is again non-Setlike.

    The test code does not explore the hashCode problems, because it uses uniformly-generated integers, and the hashCode of an integer is itself. This happens to also avoid some of the other statistical problems, as well.

    SHA1 is a reasonable hashfunction to use, since it is commonly relatively fast. MD5 is reasonable, too.

    It might seem arbitrary to use a cryptographic hash function, but the choice of such function isn’t very important for this kind of usage, and the differences are very real. For example, try taking the test code and generating integers between 0 and 200000. The calculations should still be right, but the prediciton is clearly missed. However, even a simple hash fixes this:

    public static int hashCode(int val) {
    int h = 0;
    // create a strong, 32-bit hash:
    try {
    byte[] valBytes = new byte[] { (byte)val, (byte)(val >> 8), (byte)(val >> 16), (byte)(val >> 24) };
    java.security.MessageDigest md5 = java.security.MessageDigest.getInstance(“MD5”);
    byte[] res = md5.digest(valBytes);
    for (int i = 0; i < 4; i++) {
    h <<= 8;
    h |= ((int) res[i]) & 0xFF;
    }
    }
    catch (java.security.NoSuchAlgorithmException ex) {
    // cannot happen
    }
    return h;
    }

    And, of course, you can now use a 64-bit hash instead.

    It is also questionable whether this type should be marked Serializable – if so, it should have clearly denoted limitations. The problem occurs because references to the objects that are added are not retained. A hashcode may be valid only for the current execution. (One example of this would be an interned type, where only one instance exists for each unique value; the hashcode may be left at the creation count supplied by Object.) The serialization would normally cover such fixups, as well as others, but it cannot in this case.

    I don’t know if it has changed, but the Hadoop version does actually use a BitSet.

  8. ian Says:

    > The essential lack of an iterator and other operations means
    > that maybe this should not be viewed as a Set implementation,
    > anyway.

    Obviously there are a lot of Set operations it doesn’t support, but I don’t see the harm in sticking to a familiar interface for those operations that are supported. The operations that aren’t supported are clearly documented in the Javadoc.

    > Statistically, the use of the sequence output from the Random
    > generator is problematic. It results in a hash no more unique
    > than the seed, since the sequence is fully determined by that.

    Of course, a hash is never more unique than the thing that is hashed, whether you use MD5, SHA1, or any other hash function. Why would that be problematic in this application?

    > It should be made clear that this is only a set of hashcodes of
    > instances.

    Right, it is true many .hashCode() implementations are lame. I will make a note of this, and suggest that users employ something stronger.

    > It is also questionable whether this type should be marked
    > Serializable – if so, it should have clearly denoted
    > limitations.

    I believe the limitations are clearly denoted, it is called a “Bloom Filter” after all, and the Javadoc for the class clearly states “Only the add(), addAll(), contains(), and containsAll() methods are implemented”. I’m not sure how I could be clearer than that.

    > I don’t know if it has changed, but the Hadoop version does
    > actually use a BitSet.

    It does now, I believe they changed it after I reported the problem to them.

  9. Rob Says:

    > Obviously there are a lot of Set operations it doesn’t support,
    > but I don’t see the harm in sticking to a familiar interface for
    > those operations that are supported. The operations that aren’t
    > supported are clearly documented in the Javadoc.

    But that’s the thing – you don’t get that choice, in the JDK. Sometimes, operations are described as ‘optional’, with specified behavior (commonly exceptions) when they are not supported. (Skipping remove() is fine, as long as you throw the specified UnsupportedOperationException). Java strictly specifies the behavior that any Set must exhibit, and this is in violation. That can break *any* code that uses Sets in unspecified ways, which limits componentized use of the instances. It is quite dangerous, since an unaware user may have mysterious bugs based on the methods they use your instance with, not on their own code.

    It’s a contract issue. If you don’t obey the specified contracts, no one using you can rely on the specified contracts.

    > > Statistically, the use of the sequence output from the Random
    > > generator is problematic. It results in a hash no more unique
    > > than the seed, since the sequence is fully determined by that.
    >
    > Of course, a hash is never more unique than the thing that is
    > hashed, whether you use MD5, SHA1, or any other hash function.
    > Why would that be problematic in this application?

    Because you are effectively running the instance space through a funnel. You are effectively compressing the whole space down to 4 bytes, before expanding that up to perhaps 100 bytes. That 100-byte space is sparsely filled. Directly mapping the original data can produce a dense mapping to the full 100 bytes.

    The use of a weak pseudorandom generator to simulate multiple hash functions slightly compounds the problem further. Any time you hit on the same state in a sequence, you will overlap entirely in the generated sequence for the remainder of the run.

    > > It is also questionable whether this type should be marked
    > > Serializable – if so, it should have clearly denoted
    > > limitations.
    >
    > I believe the limitations are clearly denoted, it is called
    > a “Bloom Filter” after all, and the Javadoc for the class clearly
    > states “Only the add(), addAll(), contains(), and containsAll()
    > methods are implemented”. I’m not sure how I could be clearer
    > than that.

    I don’t think I was clear here. This is not a property of Bloom filters, or of the interface contract violations. (On that point, to be clearer… Well, don’t declare that it implements Set.)

    The Serializable problem is due to the fact that Java serialization preserves reference equality and patches up various aspects that might change at runtime. Hashcode methods and Random genrators may change across platforms or minor versions or implementations, and relying on that behavior in a serialized representation is not desirable. This is why even the serialized form of Hashtable does not rely on hashcodes.

    As I mentioned above, hashcodes in particular may be stable only within a single execution. In fact, the Java specification is quite clear about this contract: In Object.hashCode(), it mentions “This integer need not remain consistent from one execution of an application to another execution of the same application.”

    It’s nice to have a simple example, and the complications don’t matter for all uses. I also actually don’t like the way Set is defined, and it is unfortunate that the weaknesses of the JDK collections make it hard to find a place for this.

  10. ian Says:

    Rob, I understand the various points you are making.

    On the Set issue, I do agree that there is some inelegance in subclassing Set without fully meeting the specification for Set, however the contract for SimpleBloomFilter is accurately specified in the JavaDoc for SimpleBloomFilter, and I think most people will take this to be authoritative. I think the benefits of sticking to a familiar interface outweigh the potential confusion. We may just need to agree to disagree on this.

    I do agree that reliance on .hashCode(), and 4 byte hashes, and Random() will increase false-positive rates, but my goal here was to create a simple and fast BloomFilter implementation, not a cryptographically optimal one (hence the name).

    The instability of the hashCode() methods is definitely a valid point, although not a problem in the specific application for which I originally wrote this code. Nonetheless, I will add a note to the article above to warn people about this.

  11. Rob Says:

    Well, not much more to add, then…

    I suppose you could make it specifically just a set of integers, with only convenience methods that will contains(Object) or add(Object) by getting the hashCode() and passing it through. That would make the behavior very clear… You could leave it generic to allow only a certain type for the convenience call, but with no observer methods ther is little benefit.

    To beat the pettiest aspect to death… I do still have trouble seeing your Set argument. What is familiar to people is the names of add() and contains(), and you can always retain that. People using this type, though, cannot pass it to other methods, such as JDK methods like Collections.unmodifiableSet(), and they cannot even pass it to methods created by their own group. Any such abstracted use is just asking for trouble. As such, if they are only using it in their own internal type implementation or code fragments, isn’t it enough to have the methods of the specified names? They don’t need an interface. (The only marginal case where it would be used would be if a user had a collection of Sets, or a variable that might be this type or a Set. I don’t see that as a major use case, and it still leaves usage dangerous.)

    I guess now I can just agree to disagree…

  12. Miguel Jimeno Says:

    Part 1 of 2

    Hello Ian,

    Thank you for writing and making available the Bloom filter (bf) class. In your bf implementation you take one hash of the element to be mapped (or tested) to seed an RNG to make it generate k bit locations in the bf. This approach increases the probability of false positive by:

    p_new = (p * 1) + ((1 – p) * p_false)

    where p_false is the well known “theory” probability of false positive for a bf and p is the probability of a test string that is not ‘stored’ in the bf having the same seed of the RNG as a string that is actually stored. If there are n elements stored in a bf and a hash value is 32-bits, then the probability of a random element having the same first hash value as one of the stored elements is roughly (we assume all stored strings have different first hash value, which is somewhat optimistic):

    p = n / (2^32)

    Obviously, if p << p_false, then p_new will be about the same as p_false. However, this may not always be the case. Consider a bf with m/n = 32, k = 22, and 1 million elements stored. For such a bf, p_false = 2.1×10^-7 (see the wikipedia bf entry to see the theory for this calculation). Computing k bit locations using an RNG will result in p_new = 2.3×10^-3. Or, an increase in probability of false positive by about 10000 times! This is not trivial!

    (Continuation in following post)

  13. Miguel Jimeno Says:

    Part 2 of 2

    The best way to correct this problem is to compute k independent hashes for each element stored. The add method() can be rewritten as:

    public boolean add(E o)
    {
    Random r;
    int randP;
    for (int x = 0; x < k; x++)
    {
    randP = (int)(hashCode(o.toString() + Integer.toString(x)) % bitArraySize);
    bitSet.set(randPos, true);
    }
    return false;
    }

    where hashCode() is the MD5 function proposed by one of the readers of your blog (I modified it to return a long) and I use the same string with the counter x appended at the end of it to generate k different hashes using the same hashCode() function.

    My experiments show:

    1) The current implementation does result in a false positive that matches the above analysis

    2) That this new method improves the probability of false positive to very near theory and requires about 2.5x more CPU time to map or test for an element.

    A middle point between CPU time and false positive might be using a combination of MD5 hashes (not k) and numbers from an RNG sequence. This would affect a little bit the false positive but will consume less CPU time. Please take a look at this and it would be wonderful if it could add something to the discussion.

    Regards,
    Miguel Jimeno

  14. arwilliamson Says:

    Ian, what a wonderful blog entry. I got referred to here from an Oracle representative of all places!

    One question I do have, I have not yet played with your code, but have you tested it with very large numbers?

    I am looking to track around 20,000,000 items and was wondering if it worked well.

  15. iancjclarke Says:

    arwilliamson: If I were doing it with that many items I’d probably convert it to use longs rather than ints.


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: