Re: WIP: BRIN multi-range indexes

From: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
To: Alexander Korotkov <aekorotkov(at)gmail(dot)com>
Cc: Masahiko Sawada <masahiko(dot)sawada(at)2ndquadrant(dot)com>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, Michael Paquier <michael(at)paquier(dot)xyz>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Andres Freund <andres(at)anarazel(dot)de>, Mark Dilger <hornschnorter(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: WIP: BRIN multi-range indexes
Date: 2020-08-07 16:27:01
Message-ID: 20200807162701.v7lyzun32lqku3md@development
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, Aug 04, 2020 at 05:17:43PM +0200, Tomas Vondra wrote:
>On Tue, Aug 04, 2020 at 05:36:51PM +0300, Alexander Korotkov wrote:
>>Hi, Tomas!
>>
>>Sorry for the late reply.
>>
>>On Sun, Jul 19, 2020 at 6:19 PM Tomas Vondra
>><tomas(dot)vondra(at)2ndquadrant(dot)com> wrote:
>>>I think there's a number of weak points in this approach.
>>>
>>>Firstly, it assumes the summaries can be represented as arrays of
>>>built-in types, which I'm not really sure about. It clearly is not true
>>>for the bloom opclasses, for example. But even for minmax oclasses it's
>>>going to be tricky because the ranges may be on different data types so
>>>presumably we'd need somewhat nested data structure.
>>>
>>>Moreover, multi-minmax summary contains either points or intervals,
>>>which requires additional fields/flags to indicate that. That further
>>>complicates the things ...
>>>
>>>maybe we could decompose that into separate arrays or something, but
>>>honestly it seems somewhat premature - there are far more important
>>>aspects to discuss, I think (e.g. how the ranges are built/merged in
>>>multi-minmax, or whether bloom opclasses are useful at all).
>>
>>I see. But there is at least a second option to introduce a new
>>datatype with just an output function. In the similar way
>>gist/tsvector_ops uses gtsvector key type. I think it would be more
>>transparent than using just bytea. Also, this is the way we already
>>use in the core.
>>
>
>So you're proposing to have a new data types "brin_minmax_multi_summary"
>and "brin_bloom_summary" (or some other names), with output functions
>printing something nicer? I suppose that could work, and we could even
>add pageinspect functions returning the value as raw bytea.
>
>Good idea!
>

Attached is an updated version of the patch series, implementing this.
Adding the extra data types was fairly simple, because both bloom and
minmax-multi indexes already used "struct as varlena" approach, so all
that needed was a bunch of in/out functions and catalog records.

I've left the changes in separate patches for clarity, ultimately it'll
get merged into the other parts.

This reminded me that the current costing may not quite work, because
it depends on how well the index is correlated to the table. That may
be OK for minmax-multi in most cases, but for bloom it makes almost no
sense - correlation does not really matter for bloom filters, what
matters is the number of values in each range.

Consider this example:

create table t (a int);

insert into t select x from (
select (i/10) as x from generate_series(1,10000000) s(i)
order by random()
) foo;

create index on t using brin(
a int4_bloom_ops(n_distinct_per_range=6000,
false_positive_rate=0.05))
with (pages_per_range = 16);

vacuum analyze t;

test=# explain analyze select * from t where a = 10000;
QUERY PLAN
-----------------------------------------------------------------------------------------------------
Seq Scan on t (cost=0.00..169247.71 rows=10 width=4) (actual time=38.088..513.654 rows=10 loops=1)
Filter: (a = 10000)
Rows Removed by Filter: 9999990
Planning Time: 0.060 ms
Execution Time: 513.719 ms
(5 rows)

test=# set enable_seqscan = off;
SET
test=# explain analyze select * from t where a = 10000;
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on t (cost=5553.07..174800.78 rows=10 width=4) (actual time=7.790..27.585 rows=10 loops=1)
Recheck Cond: (a = 10000)
Rows Removed by Index Recheck: 224182
Heap Blocks: lossy=992
-> Bitmap Index Scan on t_a_idx (cost=0.00..5553.06 rows=9999977 width=0) (actual time=7.006..7.007 rows=9920 loops=1)
Index Cond: (a = 10000)
Planning Time: 0.052 ms
Execution Time: 27.658 ms
(8 rows)

Clearly, the main problem is in brincostestimate relying on correlation
to tweak the selectivity estimates, leading to an estimate of almost the
whole table, when in practice we only scan a tiny fraction.

Part 0008 is an experimental tweaks the logic to ignore correlation for
bloom and minmax-multi opclasses, producing this plan:

test=# explain analyze select * from t where a = 10000;
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on t (cost=5542.01..16562.95 rows=10 width=4) (actual time=12.013..34.705 rows=10 loops=1)
Recheck Cond: (a = 10000)
Rows Removed by Index Recheck: 224182
Heap Blocks: lossy=992
-> Bitmap Index Scan on t_a_idx (cost=0.00..5542.00 rows=3615 width=0) (actual time=11.108..11.109 rows=9920 loops=1)
Index Cond: (a = 10000)
Planning Time: 0.386 ms
Execution Time: 34.778 ms
(8 rows)

which is way closer to reality, of course. I'm not entirely sure it
behaves correctly for multi-column BRIN indexes, but I think as a PoC
it's sufficient.

For bloom, I think we can be a bit smarter - we could use the false
positive rate as the "minimum expected selectivity" or something like
that. After all, the false positive rate essentially means "Given a
random value, what's the chance that a bloom filter matches?" So given a
table with N ranges, we expect about (N * fpr) to match. Of course, the
problem is that this only works for "full" bloom filters. Ranges with
fewer distinct values will have much lower probability, and ranges with
unexpectedly many distinct values will have much higher probability.

But I think we can ignore that, assume the index was created with good
parameters, so the bloom filters won't degrade and the target fpr is
probably a defensive value.

For minmax-multi, we probably should not ignore correlation entirely.
It does handle imperfect correlation much more gracefully than plain
minmax, but it still depends on reasonably ordered data.

A possible improvement would be to compute average "covering" of ranges,
i.e. given the length of a column domain

D = MAX(column) - MIN(column)

compute what fraction of that is covered by a range by summing lengths
of intervals in the range, and dividing it by D. And then averaging it
over all BRIN ranges.

This would allow us to estimate how many ranges are matched by a random
value from the column domain, I think. But this requires extending what
data analyze collects for indexes - I don't think there are any stats
specific to BRIN-specific collected at the moment.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachment Content-Type Size
0001-Pass-all-keys-to-BRIN-consistent-function-a-20200807.patch text/plain 20.8 KB
0002-Move-IS-NOT-NULL-checks-to-bringetbitmap-20200807.patch text/plain 9.9 KB
0003-Move-processing-of-NULLs-from-BRIN-support--20200807.patch text/plain 16.1 KB
0004-BRIN-bloom-indexes-20200807.patch text/plain 126.7 KB
0005-add-special-pg_brin_bloom_summary-data-type-20200807.patch text/plain 5.8 KB
0006-BRIN-multi-range-minmax-indexes-20200807.patch text/plain 198.0 KB
0007-add-special-pg_brin_minmax_multi_summary-da-20200807.patch text/plain 15.3 KB
0008-tweak-costing-for-bloom-minmax-multi-indexe-20200807.patch text/plain 3.9 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Robert Haas 2020-08-07 16:29:03 Re: Issue with cancel_before_shmem_exit while searching to remove a particular registered exit callbacks
Previous Message Robert Haas 2020-08-07 16:26:49 Re: [Patch] Optimize dropping of relation buffers using dlist