Physical Address
304 North Cardinal St.
Dorchester Center, MA 02124
Physical Address
304 North Cardinal St.
Dorchester Center, MA 02124

When you’re searching petabytes of data, you can’t afford to scan every record or even every file. You need a quick way to answer the question:
“Is this key possibly in my dataset?”
Bloom filters provide exactly that — with extreme speed and tiny memory usage — by trading a small amount of false positives for massive performance gains.
A Bloom filter is a probabilistic data structure that:
Key point: They are not about exact lookup, they are about fast elimination.
m with all zeros.k independent hash functions, each mapping an input to one of m positions.1.k bit positions.0, the element is definitely not in the set.1, the element is possibly in the set.Let’s create a Bloom filter for a set of fruits: {apple, banana, cherry}.
m = 10 bitsk = 3 hash functionsRepeat for banana and cherry.
n to size m and k correctlyWhen you create a Bloom filter index on a Delta table column:
CREATE BLOOMFILTER INDEX ON TABLE orders FOR COLUMNS (customer_id) OPTIONS ('fpp' = 0.01);fpp = false positive probability target.SELECT * FROM orders WHERE customer_id = 12345;The Bloom filter tells the engine which Parquet files might have matching rows — avoiding scanning the rest.
Bloom filters are not about answering “what’s in my dataset” — they’re about quickly ruling out what’s not.
They are one of the most effective tools for reducing I/O in big data systems, especially when combined with columnar storage and metadata pruning.