Re: WIP: BRIN bloom indexes

From: Sokolov Yura <y(dot)sokolov(at)postgrespro(dot)ru>
To: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: WIP: BRIN bloom indexes
Date: 2017-10-27 15:22:15
Message-ID: 94c173b54a0aef6ae9b18157ef52f03e@postgrespro.ru
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 2017-10-19 23:15, Tomas Vondra wrote:
> Hi,
>
> The BRIN minmax opclasses work well only for data where the column is
> somewhat correlated to physical location in a table. So it works great
> for timestamps in append-only log tables, for example. When that is not
> the case (non-correlated columns) the minmax ranges get very "wide" and
> we end up scanning large portions of the tables.
>
> A typical example are columns with types that are inherently random (or
> rather non-correlated) like for example UUIDs, IP or MAC addresses, and
> so on. But it can just as easily happen even with simple IDs generated
> from a sequence.
>
> This WIP patch implements another type of BRIN index, with "summary"
> being represented by a bloom filter. So instead of building [min,max]
> range for each page range. The index is of course larger than the
> regular "minmax" BRIN index, but it's still orders of magnitude smaller
> than a btree index.
>
> Note: This is different from the index type implemented in the "bloom"
> extension. Those are not related to BRIN at all, and the index builds a
> bloom filter on individual rows (values in different columns).
>
> BTW kudos to everyone who worked on the BRIN infrastructure and API. I
> found it very intuitive and well-documented. Adding the new opclass was
> extremely straight-forward, and
>
>
> To demonstrate this on a small example, consider this table:
>
> CREATE TABLE bloom_test (id uuid, padding text);
>
> INSERT INTO bloom_test
> SELECT md5((mod(i,1000000)/100)::text)::uuid, md5(i::text)
> FROM generate_series(1,200000000) s(i);
>
> VACUUM ANALYZE bloom_test;
>
> List of relations
> Schema | Name | Type | Owner | Size | Description
> --------+------------+-------+-------+-------+-------------
> public | bloom_test | table | tomas | 16 GB |
> (1 row)
>
> The table was built so that each range contains relatively small number
> of distinct UUID values - this is typical e.g. for user activity with
> "bursts" of actions from a particular user separated by long periods of
> inactivity (with actions from other users).
>
> Now, let's build BRIN "minmax", BRIN "bloom" and BTREE indexes on the
> id
> column.
>
> create index test_brin_idx on bloom_test
> using brin (id);
>
> create index test_bloom_idx on bloom_test
> using brin (id uuid_bloom_ops);
>
> create index test_btree_idx on bloom_test (id);
>
> which gives us this:
>
> List of relations
> Schema | Name | Type | Owner | Table | Size
> --------+----------------+-------+-------+------------+---------
> public | test_bloom_idx | index | tomas | bloom_test | 12 MB
> public | test_brin_idx | index | tomas | bloom_test | 832 kB
> public | test_btree_idx | index | tomas | bloom_test | 6016 MB
> (3 rows)
>
> So yeah - the "bloom" index is larger than the default "minmax" index,
> but it's still orders of magnitude smaller than the regular btree one.
>
> Now, what about query performance? Unlike the "minmax" indexes, the
> "bloom" filter can only handle equality conditions.
>
> Let's see a query like this:
>
> select * from bloom_test
> where id = '8db1d4a6-31a6-e9a2-4e2c-0e842e1f1772';
>
> The minmax index produces this plan
>
> QUERY PLAN
> ------------------------------------------------------------------------
> Bitmap Heap Scan on bloom_test
> (cost=390.31..2130910.82 rows=20593 width=49)
> (actual time=197.974..22707.311 rows=20000 loops=1)
> Recheck Cond: (id = '8db1d4a6-31a6-e9a2-4e2c-0e842e1f1772'::uuid)
> Rows Removed by Index Recheck: 199980000
> Heap Blocks: lossy=2061856
> -> Bitmap Index Scan on test_brin_idx
> (cost=0.00..385.16 rows=5493161 width=0)
> (actual time=133.463..133.463 rows=20619520 loops=1)
> Index Cond: (id =
> '8db1d4a6-31a6-e9a2-4e2c-0e842e1f1772'::uuid)
> Planning time: 0.165 ms
> Execution time: 22707.891 ms
> (8 rows)
>
> Which is not that great, and the long duration is a direct consequence
> of "wide" ranges - the bitmap heap scan had to scan pretty much the
> whole table. (A sequential scan takes only about 15 seconds.)
>
> Now, the bloom index:
>
> QUERY PLAN
> ------------------------------------------------------------------------
> Bitmap Heap Scan on bloom_test
> (cost=5898.31..2136418.82 rows=20593 width=49)
> (actual time=24.229..338.324 rows=20000 loops=1)
> Recheck Cond: (id = '8db1d4a6-31a6-e9a2-4e2c-0e842e1f1772'::uuid)
> Rows Removed by Index Recheck: 2500448
> Heap Blocks: lossy=25984
> -> Bitmap Index Scan on test_bloom_idx
> (cost=0.00..5893.16 rows=5493161 width=0)
> (actual time=22.811..22.811 rows=259840 loops=1)
> Index Cond: (id =
> '8db1d4a6-31a6-e9a2-4e2c-0e842e1f1772'::uuid)
> Planning time: 0.201 ms
> Execution time: 338.946 ms
> (8 rows)
>
> So, way better.
>
> For comparison, a simple index scan / bitmap index scan using the btree
> take only about ~10ms in this case, but that's mostly thanks to the
> whole dataset being entirely in-memory.
>
> There are some remaining open questions.
>
> 1) sizing the bloom filter
>
> Firstly, the size of the filter is currently static, based on 1000
> distinct values per range, with 5% false-positive rate. Those are
> mostly
> arbitrary values, and I don't have any clear idea how to determine
> optimal values.
>
> Maybe we could do the sizing based on ndistinct value for the indexed
> column? It also depends on the pages_per_range value, so perhaps it
> should be a user-defined option too.
>
> An interesting feature is that the bloom indexes "degrade locally". If
> a
> page range has significantly more distinct values, the bloom filter may
> be way too small and will suffer by high false-positive rate. But it
> only affects that one page range, and the rest may be perfectly fine.
>
> I was thinking about disabling the bloom filter when it gets "too full"
> (too many bits set, resulting in high false-positive cases). But that
> seems like a bad idea - the false-positive rate automatically jumps to
> 100% for that range, and we only save much smaller amount of space in
> the index. So even if the false-positive rate is 50% it still seems
> efficient to keep the bloom filter.
>
> Another thing to consider is what happens when there are very few
> distinct values in a given page range. The current code tries to be a
> bit smart in this case - instead of building the bloom filter right
> away, it initially keeps the exact values and only switches to bloom
> filter when filling the same space.
>
> 2) costing
>
> The other thing is costing of BRIN indexes. At this point it's rather
> simplistic and independent of the operator class, so the only thing
> that
> matters is size of the BRIN index. That means a "minmax" index is
> always
> considered cheaper than "bloom" index, because the optimizer has no
> idea
> the "minmax" indexes are more vulnerable to "wide ranges".
>
> But perhaps this is a non-issue, as the bloom index are meant for cases
> when minmax indexes don't work. And when minmax indexes work, they work
> better than bloom. So people are unlikely to build both of them at the
> same time.
>
>
> regards

Hi, Tomas

BRIN bloom index is a really cool feature, that definitely should be in
core distribution (either in contrib or builtin)!!!

Small suggestion for algorithm:

It is well known practice not to calculate whole hash function for every
point, but use double hashing approach.
Example of proving double hashing approach for bloom filters:
https://www.eecs.harvard.edu/~michaelm/postscripts/rsa2008.pdf

So, instead of:
for (i = 0; i < filter->nhashes; i++)
{
/* compute the hashes with a seed, used for the bloom filter */
uint32 h = ((uint32)DatumGetInt64(hash_uint32_extended(value, i))) %
filter->nbits;

/* set or check bit h */
}

such construction could be used:
/* compute the hashes, used for the bloom filter */
uint32 big_h = ((uint32)DatumGetInt64(hash_uint32_extended(value, i)));
uint32 h = big_h % filter->nbits;
/* ensures d is never >= filter->nbits */
uint32 d = big_h % (filter->nbits - 1);

for (i = 0; i < filter->nhashes; i++)
{
/* set or check bit h */

/* compute next bit h */
h += d++;
if (h >= filter->nbits) h -= filter->nbits;
if (d == filter->nbits) d = 0;
}

Modulo of one 64bit result by two coprime numbers (`nbits` and
`nbits-1`)
gives two good-quality functions usable for double hashing.

With regards,
--
Sokolov Yura
Postgres Professional: https://postgrespro.ru
The Russian Postgres Company

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Peter Geoghegan 2017-10-27 15:24:38 Re: MERGE SQL Statement for PG11
Previous Message Simon Riggs 2017-10-27 14:41:29 Re: MERGE SQL Statement for PG11