Re: WIP: BRIN multi-range indexes

From: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
To: John Naylor <john(dot)naylor(at)2ndquadrant(dot)com>
Cc: Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: WIP: BRIN multi-range indexes
Date: 2020-09-24 23:50:45
Message-ID: 20200924235045.bgxsilvf7askukph@development
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Thu, Sep 24, 2020 at 05:18:03PM -0400, John Naylor wrote:
>On Mon, Sep 21, 2020 at 3:56 PM Tomas Vondra
><tomas(dot)vondra(at)2ndquadrant(dot)com> wrote:
>>
>> On Mon, Sep 21, 2020 at 01:42:34PM -0400, John Naylor wrote:
>
>> >While playing around with the numbers I had an epiphany: At the
>> >defaults, the filter already takes up ~4.3kB, over half the page.
>> >There is no room for another tuple, so if we're only indexing one
>> >column, we might as well take up the whole page.
>>
>> Hmm, yeah. I may be wrong but IIRC indexes don't support external
>> storage but compression is still allowed. So even if those defaults are
>> a bit higher than needed that should make the bloom filters a bit more
>> compressible, and thus fit multiple BRIN tuples on a single page.
>
>> Not sure about how much we want to rely on these optimizations, though,
>> considering multi-column indexes kinda break this.
>
>Yeah. Okay, then it sounds like we should go in the other direction,
>as the block comment at the top of brin_bloom.c implies. Indexes with
>multiple bloom-indexed columns already don't fit in one 8kB page, so I
>think every documented example should have a much lower
>pages_per_range. Using 32 pages per range with max tuples gives n =
>928. With default p, that's about 1.1 kB per brin tuple, so one brin
>page can index 224 pages, much more than with the default 128.
>
>Hmm, how ugly would it be to change the default range size depending
>on the opclass?
>

Not sure. What would happen for multi-column BRIN indexes with different
opclasses?

>If indexes don't support external storage, that sounds like a pain to
>add. Also, with very small fpr, you can easily get into many megabytes
>of filter space, which kind of defeats the purpose of brin in the
>first place.
>

True.

>There is already this item from the brin readme:
>
>* Different-size page ranges?
> In the current design, each "index entry" in a BRIN index covers the same
> number of pages. There's no hard reason for this; it might make sense to
> allow the index to self-tune so that some index entries cover smaller page
> ranges, if this allows the summary values to be more compact. This
>would incur
> larger BRIN overhead for the index itself, but might allow better pruning of
> page ranges during scan. In the limit of one index tuple per page, the index
> itself would occupy too much space, even though we would be able to skip
> reading the most heap pages, because the summary values are tight; in the
> opposite limit of a single tuple that summarizes the whole table, we wouldn't
> be able to prune anything even though the index is very small. This can
> probably be made to work by using the range map as an index in itself.
>
>This sounds like a lot of work, but would be robust.
>

Yeah. I think it's a fairly independent / orthogonal project.

>Anyway, given that this is a general problem and not specific to the
>prime partition algorithm, I'll leave that out of the attached patch,
>named as a .txt to avoid confusing the cfbot.
>
>> >We could also generate the primes via a sieve instead, which is really
>> >fast (and more code). That would be more robust, but that would require
>> >the filter to store the actual primes used, so 20 more bytes at max k =
>> >10. We could hard-code space for that, or to keep from hard-coding
>> >maximum k and thus lowest possible false positive rate, we'd need more
>> >bookkeeping.
>> >
>>
>> I don't think the efficiency of this code matters too much - it's only
>> used once when creating the index, so the simpler the better. Certainly
>> for now, while testing the partitioning approach.
>
>To check my understanding, isn't bloom_init() called for every tuple?
>Agreed on simplicity so done this way.
>

No, it's only called for the first non-NULL value in the page range
(unless I made a boo boo when writing that code).

>> >So, with the partition approach, we'd likely have to set in stone
>> >either max nbits, or min target false positive rate. The latter option
>> >seems more principled, not only for the block size, but also since the
>> >target fp rate is already fixed by the reloption, and as I've
>> >demonstrated, we can often go above and beyond the reloption even
>> >without changing k.
>> >
>>
>> That seems like a rather annoying limitation, TBH.
>
>I don't think the latter is that bad. I've capped k at 10 for
>demonstration's sake.:
>
>(928 is from using 32 pages per range)
>
>n k m p
>928 7 8895 0.01
>928 10 13343 0.001 (lowest p supported in patch set)
>928 13 17790 0.0001
>928 10 18280 0.0001 (still works with lower k, needs higher m)
>928 10 17790 0.00012 (even keeping m from row #3, capping k doesn't
>degrade p much)
>
>Also, k seems pretty robust against small changes as long as m isn't
>artificially constrained and as long as p is small.
>
>So I *think* it's okay to cap k at 10 or 12, and not bother with
>adjusting m, which worsens space issues. As I found before, lowering k
>raises target fpr, but seems more robust to overshooting ndistinct. In
>any case, we only need k * 2 bytes to store the partition lengths.
>
>The only way I see to avoid any limitation is to make the array of
>primes variable length, which could be done by putting the filter
>offset calculation into a macro. But having two variable-length arrays
>seems messy.
>

Hmmm. I wonder how would these limitations impact the conclusions from
the one-hashing paper? Or was this just for the sake of a demonstration?

I'd suggest we just do the simplest thing possible (be it a hard-coded
table of primes or a sieve) and then evaluate if we need to do something
more sophisticated.

>> >Hmm, I'm not sure I understand you. I can see not caring to trim wasted
>> >bits, but we can't set/read off the end of the filter. If we don't
>> >wrap, we could just skip reading/writing that bit. So a tiny portion of
>> >access would be at k - 1. The paper is not clear on what to do here,
>> >but they are explicit in minimizing the absolute value, which could go
>> >on either side.
>> >
>>
>> What I meant is that is that the paper says this:
>>
>> Given a planned overall length mp for a Bloom filter, we usually
>> cannot get k prime numbers to make their sum mf to be exactly mp. As
>> long as the difference between mp and mf is small enough, it neither
>> causes any trouble for the software implementation nor noticeably
>> shifts the false positive ratio.
>>
>> Which I think means we can pick mp, generate k primes with sum mf close
>> to mp, and just use that with mf bits.
>
>Oh, I see. When I said "trim" I meant exactly that (when mf < mp).
>Yeah, we can bump it up as well for the other case. I've done it that
>way.
>

OK

>> >+ add_local_real_reloption(relopts, "false_positive_rate", + "desired
>> >false-positive rate for the bloom filters", +
>> >BLOOM_DEFAULT_FALSE_POSITIVE_RATE, + 0.001, 1.0, offsetof(BloomOptions,
>> >falsePositiveRate));
>> >
>> >When I set fp to 1.0, the reloption code is okay with that, but then
>> >later the assertion gets triggered.
>> >
>>
>> Hmm, yeah. I wonder what to do about that, considering indexes with fp
>> 1.0 are essentially useless.
>
>Not just useless -- they're degenerate. When p = 1.0, m = k = 0 -- We
>cannot accept this value from the user. Looking up thread, 0.1 was
>suggested as a limit. That might be a good starting point.
>

Makes sense, I'll fix it that way.

>This is interesting work! Having gone this far, I'm going to put more
>attention to the multi-minmax patch and actually start performance
>testing.
>

Cool, thanks! I'll take a look at your one-hashing patch tomorrow.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Soumyadeep Chakraborty 2020-09-24 23:56:46 Re: Refactor pg_rewind code and make it work against a standby
Previous Message Thomas Munro 2020-09-24 23:47:20 Re: Handing off SLRU fsyncs to the checkpointer