Re: WIP: BRIN multi-range indexes

From: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Andres Freund <andres(at)anarazel(dot)de>
Cc: Mark Dilger <hornschnorter(at)gmail(dot)com>, Alvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org>, pgsql-hackers(at)postgreSQL(dot)org
Subject: Re: WIP: BRIN multi-range indexes
Date: 2018-06-24 02:01:47
Message-ID: 459eef3e-48c7-0f5a-8356-992442a78bb6@2ndquadrant.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

Attached is rebased version of this BRIN patch series, fixing mostly the
breakage due to 372728b0 (aka initial-catalog-data format changes). As
2018-07 CF is meant for almost-ready patches, this is more a 2018-09
material. But perhaps someone would like to take a look - and I'd have
to fix it anyway ...

At the pgcon dev meeting I suggested that long-running patches should
have a "summary" post once in a while, so that reviewers don't have to
reread the whole thread and follow all the various discussions. So let
me start with this thread, although it's not a particularly long or
complex one, nor does it have a long discussion. But anyway ...

The patches introduce two new BRIN opclasses - minmax-multi and bloom.

minmax-multi
============

minmax-multi is a variant of the current minmax opclass that handles
cases where the plain minmax opclass degrades due to outlier values.

Imagine almost perfectly correlated data (say, timestamps in a log
table) - that works great with regular minmax indexes. But if you go and
delete a bunch of historical messages (for whatever reason), new rows
with new timestamps will be routed to the empty space and the minmax
indexes will degrade because the ranges will get much "wider" due to the
new values.

The minmax-multi indexes deal with that by maintaining not a single
minmax range, but several of them. That allows tracking the outlier
values separately, without constructing one wide minmax range.

Consider this artificial example:

create table t (a bigint, b int);

alter t set (fillfactor=95);

insert into t select i + 1000*random(), i+1000*random()
from generate_series(1,100000000) s(i);

update t set a = 1, b = 1 where random() < 0.001;
update t set a = 100000000, b = 100000000 where random() < 0.001;

Now if you create a regular minmax index, it's going to perform
terribly, because pretty much every minmax range is [1,100000000] thanks
to the update of 0.1% of rows.

create index on t using brin (a);

explain analyze select * from t
where a between 1923300::int and 1923600::int;

QUERY PLAN
-----------------------------------------------------------------
Bitmap Heap Scan on t (cost=75.11..75884.45 rows=319 width=12)
(actual time=948.906..101739.892 rows=308 loops=1)
Recheck Cond: ((a >= 1923300) AND (a <= 1923600))
Rows Removed by Index Recheck: 99999692
Heap Blocks: lossy=568182
-> Bitmap Index Scan on t_a_idx (cost=0.00..75.03 rows=22587
width=0) (actual time=89.357..89.357 rows=5681920 loops=1)
Index Cond: ((a >= 1923300) AND (a <= 1923600))
Planning Time: 2.161 ms
Execution Time: 101740.776 ms
(8 rows)

But with the minmax-multi opclass, this is not an issue:

create index on t using brin (a int8_minmax_multi_ops);

QUERY PLAN
-------------------------------------------------------------------
Bitmap Heap Scan on t (cost=1067.11..76876.45 rows=319 width=12)
(actual time=38.906..49.763 rows=308 loops=1)
Recheck Cond: ((a >= 1923300) AND (a <= 1923600))
Rows Removed by Index Recheck: 22220
Heap Blocks: lossy=128
-> Bitmap Index Scan on t_a_idx (cost=0.00..1067.03 rows=22587
width=0) (actual time=28.069..28.069 rows=1280 loops=1)
Index Cond: ((a >= 1923300) AND (a <= 1923600))
Planning Time: 1.715 ms
Execution Time: 50.866 ms
(8 rows)

Which is clearly a big improvement.

Doing this required some changes to how BRIN evaluates conditions on
page ranges. With a single minmax range it was enough to evaluate them
one by one, but minmax-multi needs to see all of them at once (to match
them against the partial ranges).

Most of the complexity is in building the summary, particularly picking
which values (partial ranges) to merge. The max number of values in the
summary is specified as values_per_range index reloption, and by default
it's set to 64, so there can be either 64 points or 32 intervals or some
combination of those.

I've been thinking about some automated way to tune this (either
globally or for each page range independently), but so far I have not
been very successful. The challenge is that making good decisions
requires global information about values in the column (e.g. global
minimum and maximum). I think the reloption with 64 as a default is a
good enough solution for now.

Perhaps the stats from pg_statistic would be useful for improving this
in the future, but I'm not sure.

bloom
=====

As the name suggests, this opclass uses bloom filter for the summary.
Compared to the minmax-multi it's a bit more experimental idea, but I
believe the foundations are safe.

Using bloom filter means that the index can only support equalities, but
for many use cases that's an acceptable limitation - UUID, IP addresses,
... (various identifiers in general).

Of course, how to size the bloom filter? It's worth noting the false
positive rate of the filter is essentially the fraction of a table that
will be scanned every time.

Similarly to the minmax-multi, parameters for computing optimal filter
size are set as reloptions (false_positive_rate, n_distinct_per_range)
with some reasonable defaults (1% false positive rate and distinct
values 10% of maximum heap tuples in a page range).

Note: When building the filter, we don't compute the hashes from the
original values, but we first use the type-specific hash function (the
same we'd use for hash indexes or hash joins) and then use the hash a as
an input for the bloom filter. This generally works fine, but if "our"
hash function generates a lot of collisions, it increases false positive
ratio of the whole filter. I'm not aware of a case where this would be
an issue, though.

What further complicates sizing of the bloom filter is available space -
the whole bloom filter needs to fit onto an 8kB page, and "full" bloom
filters with about 1/2 the bits set are pretty non-compressible. So
there's maybe ~8000 bytes for the bitmap. So for columns with many
distinct values, it may be necessary to make the page range smaller, to
reduce the number of distinct values in it.

And of course it requires good ndistinct estimates, not just for the
column as a whole, but for a single page range (because that's what
matters for sizing the bloom filter). Which is not a particularly
reliable estimate, I'm afraid.

So reloptions seem like a sufficient solution, at least for now.

open questions
==============

* I suspect the definition of cross-type opclasses (int2 vs. int8) are
not entirely correct. That probably needs another look.

* The bloom filter now works in two modes - sorted (where in the sorted
mode it stores the hashes directly) and hashed (the usual bloom filter
behavior). The idea is that for ranges with significantly fewer distinct
values, we only store those to save space (instead of allocating the
whole bloom filter with mostly 0 bits).

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachment Content-Type Size
0001-Pass-all-keys-to-BRIN-consistent-function-at-once.patch.gz application/gzip 5.4 KB
0002-Move-IS-NOT-NULL-checks-to-bringetbitmap.patch.gz application/gzip 3.0 KB
0003-BRIN-bloom-indexes.patch.gz application/gzip 20.5 KB
0004-BRIN-multi-range-minmax-indexes.patch.gz application/gzip 29.6 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Bruce Momjian 2018-06-24 02:29:49 Re: pg_upgrade from 9.4 to 10.4
Previous Message Charles Cui 2018-06-24 01:00:21 Re: [GSoC] array processing questions