Re: WIP: BRIN multi-range indexes

From: Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com>
To: John Naylor <john(dot)naylor(at)enterprisedb(dot)com>
Cc: PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: WIP: BRIN multi-range indexes
Date: 2021-01-22 01:05:55
Message-ID: d288a5f2-3225-64c4-2be0-62379a138a34@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 1/20/21 1:07 AM, Tomas Vondra wrote:
> On 1/19/21 9:44 PM, John Naylor wrote:
>> On Tue, Jan 12, 2021 at 1:42 PM Tomas Vondra
>> <tomas(dot)vondra(at)enterprisedb(dot)com <mailto:tomas(dot)vondra(at)enterprisedb(dot)com>>
>> wrote:
>>  > I suspect it'd due to minmax having to decide which "ranges" to merge,
>>  > which requires repeated sorting, etc. I certainly don't dare to claim
>>  > the current algorithm is perfect. I wouldn't have expected such big
>>  > difference, though - so definitely worth investigating.
>>
>> It seems that monotonically increasing (or decreasing) values in a
>> table are a worst case scenario for multi-minmax indexes, or
>> basically, unique values within a range. I'm guessing it's because it
>> requires many passes to fit all the values into a limited number of
>> ranges. I tried using smaller pages_per_range numbers, 32 and 8, and
>> that didn't help.
>>
>> Now, with a different data distribution, using only 10 values that
>> repeat over and over, the results are muchs more sympathetic to
>> multi-minmax:
>>
>> insert into iot (num, create_dt)
>> select random(), '2020-01-01 0:00'::timestamptz + (x % 10 || '
>> seconds')::interval
>> from generate_series(1,5*365*24*60*60) x;
>>
>> create index cd_single on iot using brin(create_dt);
>> 27.2s
>>
>> create index cd_multi on iot using brin(create_dt
>> timestamptz_minmax_multi_ops);
>> 30.4s
>>
>> create index cd_bt on iot using btree(create_dt);
>> 61.8s
>>
>> Circling back to the monotonic case, I tried running a simple perf
>> record on a backend creating a multi-minmax index on a timestamptz
>> column and these were the highest non-kernel calls:
>> +   21.98%    21.91%  postgres         postgres            [.]
>> FunctionCall2Coll
>> +    9.31%     9.29%  postgres         postgres            [.]
>> compare_combine_ranges
>> +    8.60%     8.58%  postgres         postgres            [.] qsort_arg
>> +    5.68%     5.66%  postgres         postgres            [.]
>> brin_minmax_multi_add_value
>> +    5.63%     5.60%  postgres         postgres            [.]
>> timestamp_lt
>> +    4.73%     4.71%  postgres         postgres            [.]
>> reduce_combine_ranges
>> +    3.80%     0.00%  postgres         [unknown]           [.]
>> 0x0320016800040000
>> +    3.51%     3.50%  postgres         postgres            [.]
>> timestamp_eq
>>
>> There's no one place that's pathological enough to explain the 4x
>> slowness over traditional BRIN and nearly 3x slowness over btree when
>> using a large number of unique values per range, so making progress
>> here would have to involve a more holistic approach.
>>
>
> Yeah. This very much seems like the primary problem is in how we build
> the ranges incrementally - with monotonic sequences, we end up having to
> merge the ranges over and over again. I don't know what was the
> structure of the table, but I guess it was kinda narrow (very few
> columns), which exacerbates the problem further, because the number of
> rows per range will be way higher than in real-world.
>
> I do think the solution to this might be to allow more values during
> batch index creation, and only "compress" to the requested number at the
> very end (when serializing to on-disk format).
> > There are a couple additional comments about possibly replacing
> sequential scan with a binary search, that could help a bit too.
>

OK, I took a look at this, and I came up with two optimizations that
improve this for the pathological cases. I've kept this as patches on
top of the last patch, to allow easier review of the changes.

0007 - This reworks how the ranges are reduced by merging the closest
ranges. Instead of doing that iteratively in a fairly expensive loop,
the new reduce reduce_combine_ranges() uses much simpler approach.

There's a couple more optimizations (skipping expensive code when not
needed, etc.) which should help a bit too.

0008 - This is a WIP version of the batch mode. Originally, when
building an index we'd "fill" the small buffer, combine some of the
ranges to free ~25% of space for new values. And we'd do this over and
over. This involves some expensive steps (sorting etc.) and for some
pathologic cases (like monotonic sequences) this performed particularly
poorly. The new code simply collects all values in the range, and then
does the expensive stuff only once.

Note: These parts are fairly new, with minimal testing so far.

When measured on a table with 10M rows with a number of data sets with
different patterns, the results look like this:

dataset btree minmax unpatched patched diff
--------------------------------------------------------------
monotonic-100-asc 3023 1002 1281 1722 1.34
monotonic-100-desc 3245 1042 1363 1674 1.23
monotonic-10000-asc 2597 1028 2469 2272 0.92
monotonic-10000-desc 2591 1021 2157 2246 1.04
monotonic-asc 1863 968 4884 1106 0.23
monotonic-desc 2546 1017 3520 2337 0.66
random-100 3648 1133 1594 1797 1.13
random-10000 3507 1124 1651 2657 1.61

The btree and minmax are the current indexes. unpatched means minmax
multi from the previous patch version, patched is with 0007 and 0008
applied. The diff shows patched/unpatched. The benchmarking script is
attached.

The pathological case (monotonic-asc) is now 4x faster, roughly equal to
regular minmax index build. The opposite (monotonic-desc) is about 33%
faster, roughly in line with btree.

There are a couple cases where it's actually a bit slower - those are
the cases with very few distinct values per range. I believe this
happens because in the batch mode the code does not check if the summary
already contains this value, adds it to the buffer and the last step
ends up being more expensive than this.

I believe there's some "compromise" between those two extremes, i.e. we
should use buffer that is too small or too large, but something in
between, so that the reduction happens once in a while but not too often
(as with the original aggressive approach).

FWIW, none of this is likely to be an issue in practice, because (a)
tables usually don't have such strictly monotonic patterns, (b) people
should stick to plain minmax for cases that do. And (c) regular tables
tend to have much wider rows, so there are fewer values per range (so
that other stuff is likely more expensive that building BRIN).

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachment Content-Type Size
0001-Pass-all-scan-keys-to-BRIN-consistent-funct-20210122.patch text/x-patch 23.5 KB
0002-Move-IS-NOT-NULL-handling-from-BRIN-support-20210122.patch text/x-patch 23.3 KB
0003-Optimize-allocations-in-bringetbitmap-20210122.patch text/x-patch 4.4 KB
0004-BRIN-bloom-indexes-20210122.patch text/x-patch 129.6 KB
0005-BRIN-minmax-multi-indexes-20210122.patch text/x-patch 217.3 KB
0006-Ignore-correlation-for-new-BRIN-opclasses-20210122.patch text/x-patch 4.2 KB
0007-patched-2-20210122.patch text/x-patch 14.9 KB
0008-batch-build-20210122.patch text/x-patch 12.1 KB
brin.sh application/x-shellscript 20.8 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Hou, Zhijie 2021-01-22 01:08:15 RE: Parallel INSERT (INTO ... SELECT ...)
Previous Message Michail Nikolaev 2021-01-22 00:56:39 Re: [PATCH] Full support for index LP_DEAD hint bits on standby