WIP: BRIN bloom indexes

From: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
To: pgsql-hackers(at)postgreSQL(dot)org
Subject: WIP: BRIN bloom indexes
Date: 2017-10-19 20:15:32
Message-ID: 5d78b774-7e9c-c94e-12cf-fef51cc89b1a@2ndquadrant.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

The BRIN minmax opclasses work well only for data where the column is
somewhat correlated to physical location in a table. So it works great
for timestamps in append-only log tables, for example. When that is not
the case (non-correlated columns) the minmax ranges get very "wide" and
we end up scanning large portions of the tables.

A typical example are columns with types that are inherently random (or
rather non-correlated) like for example UUIDs, IP or MAC addresses, and
so on. But it can just as easily happen even with simple IDs generated
from a sequence.

This WIP patch implements another type of BRIN index, with "summary"
being represented by a bloom filter. So instead of building [min,max]
range for each page range. The index is of course larger than the
regular "minmax" BRIN index, but it's still orders of magnitude smaller
than a btree index.

Note: This is different from the index type implemented in the "bloom"
extension. Those are not related to BRIN at all, and the index builds a
bloom filter on individual rows (values in different columns).

BTW kudos to everyone who worked on the BRIN infrastructure and API. I
found it very intuitive and well-documented. Adding the new opclass was
extremely straight-forward, and

To demonstrate this on a small example, consider this table:

CREATE TABLE bloom_test (id uuid, padding text);

INSERT INTO bloom_test
SELECT md5((mod(i,1000000)/100)::text)::uuid, md5(i::text)
FROM generate_series(1,200000000) s(i);

VACUUM ANALYZE bloom_test;

List of relations
Schema | Name | Type | Owner | Size | Description
--------+------------+-------+-------+-------+-------------
public | bloom_test | table | tomas | 16 GB |
(1 row)

The table was built so that each range contains relatively small number
of distinct UUID values - this is typical e.g. for user activity with
"bursts" of actions from a particular user separated by long periods of
inactivity (with actions from other users).

Now, let's build BRIN "minmax", BRIN "bloom" and BTREE indexes on the id
column.

create index test_brin_idx on bloom_test
using brin (id);

create index test_bloom_idx on bloom_test
using brin (id uuid_bloom_ops);

create index test_btree_idx on bloom_test (id);

which gives us this:

List of relations
Schema | Name | Type | Owner | Table | Size
--------+----------------+-------+-------+------------+---------
public | test_bloom_idx | index | tomas | bloom_test | 12 MB
public | test_brin_idx | index | tomas | bloom_test | 832 kB
public | test_btree_idx | index | tomas | bloom_test | 6016 MB
(3 rows)

So yeah - the "bloom" index is larger than the default "minmax" index,
but it's still orders of magnitude smaller than the regular btree one.

Now, what about query performance? Unlike the "minmax" indexes, the
"bloom" filter can only handle equality conditions.

Let's see a query like this:

select * from bloom_test
where id = '8db1d4a6-31a6-e9a2-4e2c-0e842e1f1772';

The minmax index produces this plan

QUERY PLAN
------------------------------------------------------------------------
Bitmap Heap Scan on bloom_test
(cost=390.31..2130910.82 rows=20593 width=49)
(actual time=197.974..22707.311 rows=20000 loops=1)
Recheck Cond: (id = '8db1d4a6-31a6-e9a2-4e2c-0e842e1f1772'::uuid)
Rows Removed by Index Recheck: 199980000
Heap Blocks: lossy=2061856
-> Bitmap Index Scan on test_brin_idx
(cost=0.00..385.16 rows=5493161 width=0)
(actual time=133.463..133.463 rows=20619520 loops=1)
Index Cond: (id = '8db1d4a6-31a6-e9a2-4e2c-0e842e1f1772'::uuid)
Planning time: 0.165 ms
Execution time: 22707.891 ms
(8 rows)

Which is not that great, and the long duration is a direct consequence
of "wide" ranges - the bitmap heap scan had to scan pretty much the
whole table. (A sequential scan takes only about 15 seconds.)

Now, the bloom index:

QUERY PLAN
------------------------------------------------------------------------
Bitmap Heap Scan on bloom_test
(cost=5898.31..2136418.82 rows=20593 width=49)
(actual time=24.229..338.324 rows=20000 loops=1)
Recheck Cond: (id = '8db1d4a6-31a6-e9a2-4e2c-0e842e1f1772'::uuid)
Rows Removed by Index Recheck: 2500448
Heap Blocks: lossy=25984
-> Bitmap Index Scan on test_bloom_idx
(cost=0.00..5893.16 rows=5493161 width=0)
(actual time=22.811..22.811 rows=259840 loops=1)
Index Cond: (id = '8db1d4a6-31a6-e9a2-4e2c-0e842e1f1772'::uuid)
Planning time: 0.201 ms
Execution time: 338.946 ms
(8 rows)

So, way better.

For comparison, a simple index scan / bitmap index scan using the btree
take only about ~10ms in this case, but that's mostly thanks to the
whole dataset being entirely in-memory.

There are some remaining open questions.

1) sizing the bloom filter

Firstly, the size of the filter is currently static, based on 1000
distinct values per range, with 5% false-positive rate. Those are mostly
arbitrary values, and I don't have any clear idea how to determine
optimal values.

Maybe we could do the sizing based on ndistinct value for the indexed
column? It also depends on the pages_per_range value, so perhaps it
should be a user-defined option too.

An interesting feature is that the bloom indexes "degrade locally". If a
page range has significantly more distinct values, the bloom filter may
be way too small and will suffer by high false-positive rate. But it
only affects that one page range, and the rest may be perfectly fine.

I was thinking about disabling the bloom filter when it gets "too full"
(too many bits set, resulting in high false-positive cases). But that
seems like a bad idea - the false-positive rate automatically jumps to
100% for that range, and we only save much smaller amount of space in
the index. So even if the false-positive rate is 50% it still seems
efficient to keep the bloom filter.

Another thing to consider is what happens when there are very few
distinct values in a given page range. The current code tries to be a
bit smart in this case - instead of building the bloom filter right
away, it initially keeps the exact values and only switches to bloom
filter when filling the same space.

2) costing

The other thing is costing of BRIN indexes. At this point it's rather
simplistic and independent of the operator class, so the only thing that
matters is size of the BRIN index. That means a "minmax" index is always
considered cheaper than "bloom" index, because the optimizer has no idea
the "minmax" indexes are more vulnerable to "wide ranges".

But perhaps this is a non-issue, as the bloom index are meant for cases
when minmax indexes don't work. And when minmax indexes work, they work
better than bloom. So people are unlikely to build both of them at the
same time.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachment Content-Type Size
brin-bloom-v1.patch text/x-patch 61.7 KB

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tomasz Ostrowski 2017-10-19 20:39:59 Queuing all tables for analyze after recovery
Previous Message Nico Williams 2017-10-19 20:05:48 Re: [PATCH] Add ALWAYS DEFERRED option for constraints