Bloom Indexes - bit array length and the total number of bits (or hash functions ?? ) !

From: Avinash Kumar <avinash(dot)vallarapu(at)gmail(dot)com>
To: pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject: Bloom Indexes - bit array length and the total number of bits (or hash functions ?? ) !
Date: 2019-06-08 02:43:08
Message-ID: CAN0TujeD-DxQ=TjdQkEPXdVSxtt6MpAhJ=TazsAHxuRQ4_E5qQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi All,

I was testing bloom indexes today. I understand bloom indexes uses bloom
filters.

As i understand, a bloom filter is a bit array of m bits and a constant "k"
number of hash functions are used to generate hashes for the data. And then
the appropriate bits are set to 1.

I was doing the following test where i was generating 10 million records
and testing bloom indexes.

CREATE TABLE foo.bar (id int, dept int, id2 int, id3 int, id4 int, id5
int,id6 int,id7 int,details text, zipcode int);

INSERT INTO foo.bar SELECT (random() * 1000000)::int, (random() *
1000000)::int,(random() * 1000000)::int,(random() * 1000000)::int,(random()
* 1000000)::int,(random() * 1000000)::int, (random() *
1000000)::int,(random() * 1000000)::int,md5(g::text), floor(random()*
(20000-9999 + 1) + 9999) from generate_series(1,100*1e6) g;

As per the documentation, bloom indexes accepts 2 parameters. *length* and
the *number of bits for each column*.

Here is the problem or the question i have.

I have created the following 2 Indexes.

*Index 1*
CREATE INDEX idx_bloom_bar ON foo.bar
USING bloom(id, dept, id2, id3, id4, id5, id6, zipcode)
WITH (length=48, col1=4, col2=4, col3=4, col4=4, col5=4, col6=4, col7=4,
col8=4);

*Index 2*
CREATE INDEX idx_bloom_bar ON foo.bar
USING bloom(id, dept, id2, id3, id4, id5, id6, zipcode)
WITH (length=48, col1=2, col2=2, col3=2, col4=2, col5=2, col6=2, col7=2,
col8=2);

With change in length, we of course see a difference in the Index size
which is understandable. Here i have the same length for both indexes. But,
i reduced the number of bits per each index column from 4 to 2. Both the
above indexes are of the same size. But, there is a very slight difference
in the execution time between Index 1 and Index 2 but with the same cost.

So the question here is -
I assume - number of bits = k. Where k is the total number of hash
functions used on top of the data that needs to validated. Is that correct
? If yes, why do we see the Index 1 performing better than Index 2 ?
Because, the data has to go through more hash functions (4 vs 2) in Index 1
than Index 2. So, with Index 1 it should take more time.
Also, both the indexes have ZERO false positives.
Please let me know if there is anything simple that i am missing here.

*Query *

EXPLAIN (ANALYZE, BUFFERS, VERBOSE) select * from foo.bar where id = 736833
and dept = 89861 and id2 = 573221 and id3 = 675911 and id4 = 943394 and id5
= 326756 and id6 = 597560 and zipcode = 10545;

*With Index 1 *

QUERY PLAN
----------------------------------------------------------------------------------------------------
Bitmap Heap Scan on foo.bar (cost=2647060.00..2647061.03 rows=1 width=69)
(actual time=307.000..307.000 rows=0 loops=1)
Output: id, dept, id2, id3, id4, id5, id6, id7, details, zipcode
Recheck Cond: ((bar.id = 736833) AND (bar.dept = 89861) AND (bar.id2 =
573221) AND (bar.id3 = 675911) AND (bar.id4 = 943394) AND (bar.id5 =
326756) AND (bar.id6 = 597560) AND (bar.zipcode = 10545))
Buffers: shared hit=147059
-> Bitmap Index Scan on idx_bloom_bar (cost=0.00..2647060.00 rows=1
width=0) (actual time=306.997..306.997 rows=0 loops=1)
Index Cond: ((bar.id = 736833) AND (bar.dept = 89861) AND (bar.id2
= 573221) AND (bar.id3 = 675911) AND (bar.id4 = 943394) AND (bar.id5 =
326756) AND (bar.id6 = 597560) AND (bar.zipcode = 10545))
Buffers: shared hit=147059
Planning Time: 0.106 ms
* Execution Time: 307.030 ms*
(9 rows)

*With Index 2 *

QUERY PLAN
----------------------------------------------------------------------------------------------------
Bitmap Heap Scan on foo.bar (cost=2647060.00..2647061.03 rows=1 width=69)
(actual time=420.881..420.881 rows=0 loops=1)
Output: id, dept, id2, id3, id4, id5, id6, id7, details, zipcode
Recheck Cond: ((bar.id = 736833) AND (bar.dept = 89861) AND (bar.id2 =
573221) AND (bar.id3 = 675911) AND (bar.id4 = 943394) AND (bar.id5 =
326756) AND (bar.id6 = 597560) AND (bar.zipcode = 10545))
Buffers: shared hit=147059
-> Bitmap Index Scan on idx_bloom_bar (cost=0.00..2647060.00 rows=1
width=0) (actual time=420.878..420.878 rows=0 loops=1)
Index Cond: ((bar.id = 736833) AND (bar.dept = 89861) AND (bar.id2
= 573221) AND (bar.id3 = 675911) AND (bar.id4 = 943394) AND (bar.id5 =
326756) AND (bar.id6 = 597560) AND (bar.zipcode = 10545))
Buffers: shared hit=147059
Planning Time: 0.104 ms
* Execution Time: 420.913 ms*
(9 rows)

Thanks,
Avinash.

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Arthur Zakirov 2019-06-08 04:16:49 Re: [PROPOSAL] Drop orphan temp tables in single-mode
Previous Message Michael Paquier 2019-06-08 01:45:40 Re: Temp table handling after anti-wraparound shutdown (Was: BUG #15840)