Re: WIP: BRIN multi-range indexes

From: John Naylor <john(dot)naylor(at)2ndquadrant(dot)com>
To: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
Cc: Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: WIP: BRIN multi-range indexes
Date: 2020-09-24 21:18:03
Message-ID: CACPNZCvvq9W8tbE4PQbyPJyLXWnJCPTJ099LTWrbvb5NDs86qg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Mon, Sep 21, 2020 at 3:56 PM Tomas Vondra
<tomas(dot)vondra(at)2ndquadrant(dot)com> wrote:
>
> On Mon, Sep 21, 2020 at 01:42:34PM -0400, John Naylor wrote:

> >While playing around with the numbers I had an epiphany: At the
> >defaults, the filter already takes up ~4.3kB, over half the page.
> >There is no room for another tuple, so if we're only indexing one
> >column, we might as well take up the whole page.
>
> Hmm, yeah. I may be wrong but IIRC indexes don't support external
> storage but compression is still allowed. So even if those defaults are
> a bit higher than needed that should make the bloom filters a bit more
> compressible, and thus fit multiple BRIN tuples on a single page.

> Not sure about how much we want to rely on these optimizations, though,
> considering multi-column indexes kinda break this.

Yeah. Okay, then it sounds like we should go in the other direction,
as the block comment at the top of brin_bloom.c implies. Indexes with
multiple bloom-indexed columns already don't fit in one 8kB page, so I
think every documented example should have a much lower
pages_per_range. Using 32 pages per range with max tuples gives n =
928. With default p, that's about 1.1 kB per brin tuple, so one brin
page can index 224 pages, much more than with the default 128.

Hmm, how ugly would it be to change the default range size depending
on the opclass?

If indexes don't support external storage, that sounds like a pain to
add. Also, with very small fpr, you can easily get into many megabytes
of filter space, which kind of defeats the purpose of brin in the
first place.

There is already this item from the brin readme:

* Different-size page ranges?
In the current design, each "index entry" in a BRIN index covers the same
number of pages. There's no hard reason for this; it might make sense to
allow the index to self-tune so that some index entries cover smaller page
ranges, if this allows the summary values to be more compact. This
would incur
larger BRIN overhead for the index itself, but might allow better pruning of
page ranges during scan. In the limit of one index tuple per page, the index
itself would occupy too much space, even though we would be able to skip
reading the most heap pages, because the summary values are tight; in the
opposite limit of a single tuple that summarizes the whole table, we wouldn't
be able to prune anything even though the index is very small. This can
probably be made to work by using the range map as an index in itself.

This sounds like a lot of work, but would be robust.

Anyway, given that this is a general problem and not specific to the
prime partition algorithm, I'll leave that out of the attached patch,
named as a .txt to avoid confusing the cfbot.

> >We could also generate the primes via a sieve instead, which is really
> >fast (and more code). That would be more robust, but that would require
> >the filter to store the actual primes used, so 20 more bytes at max k =
> >10. We could hard-code space for that, or to keep from hard-coding
> >maximum k and thus lowest possible false positive rate, we'd need more
> >bookkeeping.
> >
>
> I don't think the efficiency of this code matters too much - it's only
> used once when creating the index, so the simpler the better. Certainly
> for now, while testing the partitioning approach.

To check my understanding, isn't bloom_init() called for every tuple?
Agreed on simplicity so done this way.

> >So, with the partition approach, we'd likely have to set in stone
> >either max nbits, or min target false positive rate. The latter option
> >seems more principled, not only for the block size, but also since the
> >target fp rate is already fixed by the reloption, and as I've
> >demonstrated, we can often go above and beyond the reloption even
> >without changing k.
> >
>
> That seems like a rather annoying limitation, TBH.

I don't think the latter is that bad. I've capped k at 10 for
demonstration's sake.:

(928 is from using 32 pages per range)

n k m p
928 7 8895 0.01
928 10 13343 0.001 (lowest p supported in patch set)
928 13 17790 0.0001
928 10 18280 0.0001 (still works with lower k, needs higher m)
928 10 17790 0.00012 (even keeping m from row #3, capping k doesn't
degrade p much)

Also, k seems pretty robust against small changes as long as m isn't
artificially constrained and as long as p is small.

So I *think* it's okay to cap k at 10 or 12, and not bother with
adjusting m, which worsens space issues. As I found before, lowering k
raises target fpr, but seems more robust to overshooting ndistinct. In
any case, we only need k * 2 bytes to store the partition lengths.

The only way I see to avoid any limitation is to make the array of
primes variable length, which could be done by putting the filter
offset calculation into a macro. But having two variable-length arrays
seems messy.

> >Hmm, I'm not sure I understand you. I can see not caring to trim wasted
> >bits, but we can't set/read off the end of the filter. If we don't
> >wrap, we could just skip reading/writing that bit. So a tiny portion of
> >access would be at k - 1. The paper is not clear on what to do here,
> >but they are explicit in minimizing the absolute value, which could go
> >on either side.
> >
>
> What I meant is that is that the paper says this:
>
> Given a planned overall length mp for a Bloom filter, we usually
> cannot get k prime numbers to make their sum mf to be exactly mp. As
> long as the difference between mp and mf is small enough, it neither
> causes any trouble for the software implementation nor noticeably
> shifts the false positive ratio.
>
> Which I think means we can pick mp, generate k primes with sum mf close
> to mp, and just use that with mf bits.

Oh, I see. When I said "trim" I meant exactly that (when mf < mp).
Yeah, we can bump it up as well for the other case. I've done it that
way.

> >+ add_local_real_reloption(relopts, "false_positive_rate", + "desired
> >false-positive rate for the bloom filters", +
> >BLOOM_DEFAULT_FALSE_POSITIVE_RATE, + 0.001, 1.0, offsetof(BloomOptions,
> >falsePositiveRate));
> >
> >When I set fp to 1.0, the reloption code is okay with that, but then
> >later the assertion gets triggered.
> >
>
> Hmm, yeah. I wonder what to do about that, considering indexes with fp
> 1.0 are essentially useless.

Not just useless -- they're degenerate. When p = 1.0, m = k = 0 -- We
cannot accept this value from the user. Looking up thread, 0.1 was
suggested as a limit. That might be a good starting point.

This is interesting work! Having gone this far, I'm going to put more
attention to the multi-minmax patch and actually start performance
testing.

--
John Naylor https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachment Content-Type Size
one-hash-bloom-filter-20200917b.txt text/plain 8.1 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Soumyadeep Chakraborty 2020-09-24 21:38:43 Re: Add session statistics to pg_stat_database
Previous Message Anastasia Lubennikova 2020-09-24 20:40:46 Re: [PATCH] Automatic HASH and LIST partition creation