Jun's Alcove

Index Journal Projects Exhibits About

Bloom Filters are Neat

I came across the bloom filter data structure recently, and decided to write my own implementation of the concept^{^source}. It is a very cool way to reduce the size requirements of certain operations, at the cost of a configurable false-positive rate. More of the math and historical details can be found on elsewhere^{^wikipedia}.

The idea of reducing a string or byte-array value into a collection of toggled bits or counted integers, where the index is calculated by hashing the input and then iteratively hashing the results of the previous iteration until you have achieved the accuracy characteristics you're looking for, is fascinating. The tradeoffs involved are immediately clear, and the use cases are self-evident once accepted.

^{^source} https://codeberg.org/juniper/odin-bloom-filter

^{^wikipedia} https://en.wikipedia.org/wiki/Bloom_filter

Why Would You Want False Positives?

I think it's fair to say nobody ever really wants false positives, but there are plenty of use cases where they are perfectly acceptable. For example, what if you want to check if a file hash matches a hash of a known piece of malware (ignoring that this approach of malware detection is well known to be bad). In that case, storing every hash of every known malware can exhaust space very quickly. If we allow false positives and allow the end user to handle such cases, you can drastically reduce the amount of space required by using a bloom filter and accepting a small amount of error (usually <1%, but it is configurable).

So What is a Bloom Filter?

A bloom filter, in its simplest form, is an array of bits (not bytes!) that are toggled on based on a specified hash function. This style of bloom filter can only be added too, and checked for values using the same hash function.

If for any reason you need to remove values from a bloom filter, a counting bloom filter can be used! This has worse size characteristics by replacing the bit array with an array of counters (of an arbitrary bit width). This means that when adding to the bloom filter counters are incremented instead of bits set, and then on removal those counters are decremented.

I'm sure there are other styles of bloom filter, as well as ones that integrate with other data structures and services. But those are the two I consider fundamental.

Some Esoteric Use Case Thoughts

This is where I get to be unhinged, and think of some interesting use cases that may not be technically sound.

One interesting use case I can imagine is for "memory simulation" for NPCs in a video game, or other simulation. Storing complex lookups can be extremely exhaustive not only memory-wise, but also processing-wise. So perhaps you could instead use bloom filters as a "first try" lookup, where when you ask the NPC something they use the bloom filter as a preliminary check to see if they may know what you're talking about. This has the added benefit of error, something all to commonplace in our real life memory.

Another use case may be as a root data structure for some esoteric programming languages memory model. While deterministic, the chaotic nature of a bloom filter as a primary storage mechanism for all data in a programming language sounds like a very fun logical challenge. Particularly if a high false positive rate is used.

Extending upon the first use case, bloom filters would also be perfect for information decay in a strategic video game. For example, when attempting to gauge where enemy battalions may be on a map a bloom filter would provide false positives that would exist in a real life analog, where the fog of war limits what can be relayed to command.

Conclusion

Bloom filters are neat. They provide a great source of chaos via false positives while also being extremely useful in real life scenarios, and are something worth being aware of.