Re: WIP: BRIN multi-range indexes

From: John Naylor <john(dot)naylor(at)2ndquadrant(dot)com>
To: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
Cc: Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: WIP: BRIN multi-range indexes
Date: 2020-09-21 17:42:34
Message-ID: CACPNZCsnUJjGbC6pMWj0CZF9w2rRcjjohX+6X2xU5+nQesiJ8w@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Fri, Sep 18, 2020 at 6:27 PM Tomas Vondra
<tomas(dot)vondra(at)2ndquadrant(dot)com> wrote:

> But maybe we could still use this scheme by actually computing
>
> h1 = hash_uint32_extended(value, seed1);
> h2 = hash_uint32_extended(value, seed2);
>
> and then use this as the independent hash functions. I think that would
> meet the requirements of the paper.

Yeah, that would work algorithmically. It would be trivial to add to
the patch, too of course. There'd be a higher up-front cpu cost. Also,
I'm a bit cautious of rehashing hashes, and whether the two values
above are independent enough. I'm not sure either of these points
matters. My guess is the partition approach is more sound, but it has
some minor organizational challenges (see below).

> Why would 11k bits be more than we'd like to store? Assuming we could
> use the whole 8kB page for the bloom filter, that'd be about 64k bits.
> In practice there'd be a bit of overhead (page header ...) but it's
> still much more than 11k bits.

Brain fade -- I guess I thought we were avoiding being toasted, but
now I see that's not possible for BRIN storage. So, we'll want to
guard against this:

ERROR: index row size 8160 exceeds maximum 8152 for index "bar_num_idx"

While playing around with the numbers I had an epiphany: At the
defaults, the filter already takes up ~4.3kB, over half the page.
There is no room for another tuple, so if we're only indexing one
column, we might as well take up the whole page. Here MT = max tuples
per 128 8k pages, or 37120, so default ndistinct is 3712.

n k m p
MT/10 7 35580 0.01
MT/10 7 64000 0.0005
MT/10 12 64000 0.00025

Assuming ndistinct isn't way off reality, we get 20x-40x lower false
positive rate almost for free, and it'd be trivial to code! Keeping k
at 7 would be more robust, since it's equivalent to starting with n =
~6000, p = 0.006, which is still almost 2x less false positives than
you asked for. It also means nearly doubling the number of sorted
values before switching.

Going the other direction, capping nbits to 64k bits when ndistinct
gets too high, the false positive rate we can actually support starts
to drop. Here, the user requested 0.001 fpr.

n k p
4500 9 0.001
6000 7 0.006
7500 6 0.017
15000 3 0.129 (probably useless by now)
MT 1 0.440
64000 1 0.63 (possible with > 128 pages per range)

I imagine smaller pages_per_range settings are going to be useful for
skinny tables (note to self -- test). Maybe we could provide a way for
the user to see that their combination of pages_per_range, false
positive rate, and ndistinct is supportable, like
brin_bloom_get_supported_fpr(). Or document to check with
page_inspect. And that's not even considering multi-column indexes,
like you mentioned.

> But I guess we can simply make the table
> of primes a bit larger, right?

If we want to support all the above cases without falling down
entirely, it would have to go up to 32k to be safe (When k = 1 we
could degenerate to one modulus on the whole filter). That would be a
table of about 7kB, which we could binary search. [thinks for a
moment]...Actually, if we wanted to be clever, maybe we could
precalculate the primes needed for the 64k bit cases and stick them at
the end of the array. The usual algorithm will find them. That way, we
could keep the array around 2kB. However, for >8kB block size, we
couldn't adjust the 64k number, which might be okay, but not really
project policy.

We could also generate the primes via a sieve instead, which is really
fast (and more code). That would be more robust, but that would
require the filter to store the actual primes used, so 20 more bytes
at max k = 10. We could hard-code space for that, or to keep from
hard-coding maximum k and thus lowest possible false positive rate,
we'd need more bookkeeping.

So, with the partition approach, we'd likely have to set in stone
either max nbits, or min target false positive rate. The latter option
seems more principled, not only for the block size, but also since the
target fp rate is already fixed by the reloption, and as I've
demonstrated, we can often go above and beyond the reloption even
without changing k.

> >One wrinkle is that the sum of k primes is not going to match m
> >exactly. If the sum is too small, we can trim a few bits off of the
> >filter bitmap. If the sum is too large, the last partition can spill
> >into the front of the first one. This shouldn't matter much in the
> >common case since we need to round m to the nearest byte anyway.
> >
>
> AFAIK the paper simply says that as long as the sum of partitions is
> close to the requested nbits, it's good enough. So I guess we could just
> roll with that, no need to trim/wrap or something like that.

Hmm, I'm not sure I understand you. I can see not caring to trim
wasted bits, but we can't set/read off the end of the filter. If we
don't wrap, we could just skip reading/writing that bit. So a tiny
portion of access would be at k - 1. The paper is not clear on what to
do here, but they are explicit in minimizing the absolute value, which
could go on either side.

Also I found a bug:

+ add_local_real_reloption(relopts, "false_positive_rate",
+ "desired false-positive rate for the bloom filters",
+ BLOOM_DEFAULT_FALSE_POSITIVE_RATE,
+ 0.001, 1.0, offsetof(BloomOptions, falsePositiveRate));

When I set fp to 1.0, the reloption code is okay with that, but then
later the assertion gets triggered.

--
John Naylor https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Matthieu Garrigues 2020-09-21 17:55:03 Re: PATCH: Batch/pipelining support for libpq
Previous Message Tomas Vondra 2020-09-21 17:24:26 Re: Planner, check if can use consider HASH for groupings (src/backend/optimizer/plan/planner.c)