Re: WIP: BRIN multi-range indexes

From: Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com>
To: Anastasia Lubennikova <a(dot)lubennikova(at)postgrespro(dot)ru>
Cc: pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject: Re: WIP: BRIN multi-range indexes
Date: 2020-11-07 20:38:28
Message-ID: 88b3831d-8afa-4135-a420-27b12bf9fae4@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

Here's a rebased version of the patch series, to keep the cfbot happy.
I've also restricted the false positive rate to [0.001, 0.1] instead of
the original [0.0, 1.0], per discussion on this thread.

I've done a bunch of experiments, comparing the "regular" bloom indexes
with the one-hashing scheme proposed by John Naylor. I've been wondering
if there's some measurable difference, especially in:

* efficiency (query duration)

* false positive rate depending on the fill factor

So I ran a bunch of tests on synthetic data sets, varying parameters
affecting the BRIN bloom indexes:

1) different pages_per_range

2) number of distinct values per range

3) fill factor of the bloom filter (66%, 100%, 200%)

Attached is a script I used to test this, and a simple spreadsheet
summarizing the results, comparing the results for each combination of
parameters. For each combination it shows average query duration (over
10 runs) and scan fraction (what fraction of table was scanned).

Overall, I think there's very little difference, particularly in the
"match" cases when we're searching for a value that we know is in the
table. The one-hash variant seems to perform a bit better, but the
difference is fairly small.

In the "mismatch" cases (searching for value that is not in the table)
the differences are more significant, but it might be noise. It does
seem much more "green" than "red", i.e. the one-hash variant seems to be
faster (although this does depend on the values for formatting).

To sum this up, I think the one-hash approach seems interesting. It's
not going to give us huge speedups because we're only hashing int32
values anyway (not the source data), but it's worth exploring.

I've started looking at the one-hash code changes, and I've discovered a
couple issues. I've been wondering how expensive the naive prime sieve
is - it's not extremely hot code path, as we're only running it for each
page range. But still. So my plan was to create the largest bloom filter
possible, and see how much time generate_primes() takes.

So I initialized a cluster with 32kB blocks and tried to do this:

create index on t
using brin (a int4_bloom_ops(n_distinct_per_range=120000,
false_positive_rate=0.1));

which ends up using nbits=575104 (which is 2x the page size, but let's
ignore that) and nhashes=3. That however crashes and burns, because:

a) set_bloom_partitions does this:

while (primes[pidx + nhashes - 1] <= target && primes[pidx] > 0)
pidx++;

which is broken, because the second part of the condition only checks
the current index - we may end up using nhashes primes after that, and
some of them may be 0. So this needs to be:

while (primes[pidx + nhashes - 1] <= target &&
primes[pidx + nhashes] > 0)
pidx++;

(We know there's always at least one 0 at the end, so it's OK not to
check the length explicitly.)

b) set_bloom_partitions does this to generate primes:

/*
* Increase the limit to ensure we have some primes higher than
* the target partition length. The 100 value is arbitrary, but
* should be well over what we need.
*/
primes = generate_primes(target_partlen + 100);

It's not clear to me why 100 is sufficient, particularly for large page
sizes. AFAIK the primes get more and more sparse, so how does this
guarantee we'll get enough "sufficiently large" primes?

c) generate_primes uses uint16 to store the primes, so it can only
generate primes up to 32768. That's (probably) enough for 8kB pages, but
for 32kB pages it's clearly insufficient.

I've fixes these issues in a separate WIP patch, with some simple
debugging logging.

As for the original question how expensive this naive sieve is, I
haven't been able to measure any significant timings. The logging aroung
generate_primes usually looks like this:

2020-11-07 20:36:10.614 CET [232789] LOG: generating primes nbits
575104 nhashes 3 target_partlen 191701
2020-11-07 20:36:10.614 CET [232789] LOG: primes generated

So it takes 0.000 second for this extreme page size. I don't think we
need to invent anything more elaborate.

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachment Content-Type Size
i5.ods application/vnd.oasis.opendocument.spreadsheet 95.4 KB
0001-Pass-all-scan-keys-to-BRIN-consistent-funct-20201107.patch text/x-patch 23.5 KB
0002-Move-IS-NOT-NULL-handling-from-BRIN-support-20201107.patch text/x-patch 23.3 KB
0003-Optimize-allocations-in-bringetbitmap-20201107.patch text/x-patch 4.4 KB
0004-BRIN-bloom-indexes-20201107.patch text/x-patch 138.9 KB
0005-use-one-hash-bloom-variant-20201107.patch text/x-patch 8.5 KB
0006-one-hash-tweaks-20201107.patch text/x-patch 2.6 KB
0007-BRIN-minmax-multi-indexes-20201107.patch text/x-patch 217.3 KB
0008-Ignore-correlation-for-new-BRIN-opclasses-20201107.patch text/x-patch 4.2 KB
bloom-test.sh application/x-shellscript 7.4 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tomas Vondra 2020-11-07 22:05:49 Re: First-draft release notes for back branches are up
Previous Message Tom Lane 2020-11-07 20:05:42 Re: errors when there is a bit literal in ecpg