From: | Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com> |
---|---|

To: | John Naylor <john(dot)naylor(at)2ndquadrant(dot)com> |

Cc: | Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org> |

Subject: | Re: WIP: BRIN multi-range indexes |

Date: | 2020-09-21 19:56:07 |

Message-ID: | 20200921195607.gt6xbia7v55bhuwq@development |

Views: | Raw Message | Whole Thread | Download mbox | Resend email |

Thread: | |

Lists: | pgsql-hackers |

On Mon, Sep 21, 2020 at 01:42:34PM -0400, John Naylor wrote:

>On Fri, Sep 18, 2020 at 6:27 PM Tomas Vondra

><tomas(dot)vondra(at)2ndquadrant(dot)com> wrote:

>

>> But maybe we could still use this scheme by actually computing

>>

>> h1 = hash_uint32_extended(value, seed1);

>> h2 = hash_uint32_extended(value, seed2);

>>

>> and then use this as the independent hash functions. I think that would

>> meet the requirements of the paper.

>

>Yeah, that would work algorithmically. It would be trivial to add to

>the patch, too of course. There'd be a higher up-front cpu cost. Also,

>I'm a bit cautious of rehashing hashes, and whether the two values

>above are independent enough. I'm not sure either of these points

>matters. My guess is the partition approach is more sound, but it has

>some minor organizational challenges (see below).

>

OK. I don't think rehashing hashes is an issue as long as the original

hash has sufficiently low collision rate (and while we know it's not

perfect we know it works well enough for hash indexes etc.). And I doubt

the cost of the extra hash of uint32 would be noticeable.

That being said the partitioning approach might be more sound and it's

definitely worth giving it a try.

>> Why would 11k bits be more than we'd like to store? Assuming we could

>> use the whole 8kB page for the bloom filter, that'd be about 64k bits.

>> In practice there'd be a bit of overhead (page header ...) but it's

>> still much more than 11k bits.

>

>Brain fade -- I guess I thought we were avoiding being toasted, but

>now I see that's not possible for BRIN storage. So, we'll want to

>guard against this:

>

>ERROR: index row size 8160 exceeds maximum 8152 for index "bar_num_idx"

>

>While playing around with the numbers I had an epiphany: At the

>defaults, the filter already takes up ~4.3kB, over half the page.

>There is no room for another tuple, so if we're only indexing one

>column, we might as well take up the whole page.

Hmm, yeah. I may be wrong but IIRC indexes don't support external

storage but compression is still allowed. So even if those defaults are

a bit higher than needed that should make the bloom filters a bit more

compressible, and thus fit multiple BRIN tuples on a single page.

>Here MT = max tuples per 128 8k pages, or 37120, so default ndistinct

>is 3712.

>

>n k m p

>MT/10 7 35580 0.01

>MT/10 7 64000 0.0005

>MT/10 12 64000 0.00025

>

>Assuming ndistinct isn't way off reality, we get 20x-40x lower false

>positive rate almost for free, and it'd be trivial to code! Keeping k

>at 7 would be more robust, since it's equivalent to starting with n =

>~6000, p = 0.006, which is still almost 2x less false positives than

>you asked for. It also means nearly doubling the number of sorted

>values before switching.

>

>Going the other direction, capping nbits to 64k bits when ndistinct

>gets too high, the false positive rate we can actually support starts

>to drop. Here, the user requested 0.001 fpr.

>

>n k p

>4500 9 0.001

>6000 7 0.006

>7500 6 0.017

>15000 3 0.129 (probably useless by now)

>MT 1 0.440

>64000 1 0.63 (possible with > 128 pages per range)

>

>I imagine smaller pages_per_range settings are going to be useful for

>skinny tables (note to self -- test). Maybe we could provide a way for

>the user to see that their combination of pages_per_range, false

>positive rate, and ndistinct is supportable, like

>brin_bloom_get_supported_fpr(). Or document to check with page_inspect.

>And that's not even considering multi-column indexes, like you

>mentioned.

>

I agree giving users visibility into this would be useful.

Not sure about how much we want to rely on these optimizations, though,

considering multi-column indexes kinda break this.

>> But I guess we can simply make the table of primes a bit larger,

>> right?

>

>If we want to support all the above cases without falling down

>entirely, it would have to go up to 32k to be safe (When k = 1 we could

>degenerate to one modulus on the whole filter). That would be a table

>of about 7kB, which we could binary search. [thinks for a

>moment]...Actually, if we wanted to be clever, maybe we could

>precalculate the primes needed for the 64k bit cases and stick them at

>the end of the array. The usual algorithm will find them. That way, we

>could keep the array around 2kB. However, for >8kB block size, we

>couldn't adjust the 64k number, which might be okay, but not really

>project policy.

>

>We could also generate the primes via a sieve instead, which is really

>fast (and more code). That would be more robust, but that would require

>the filter to store the actual primes used, so 20 more bytes at max k =

>10. We could hard-code space for that, or to keep from hard-coding

>maximum k and thus lowest possible false positive rate, we'd need more

>bookkeeping.

>

I don't think the efficiency of this code matters too much - it's only

used once when creating the index, so the simpler the better. Certainly

for now, while testing the partitioning approach.

>So, with the partition approach, we'd likely have to set in stone

>either max nbits, or min target false positive rate. The latter option

>seems more principled, not only for the block size, but also since the

>target fp rate is already fixed by the reloption, and as I've

>demonstrated, we can often go above and beyond the reloption even

>without changing k.

>

That seems like a rather annoying limitation, TBH.

>> >One wrinkle is that the sum of k primes is not going to match m

>> >exactly. If the sum is too small, we can trim a few bits off of the

>> >filter bitmap. If the sum is too large, the last partition can spill

>> >into the front of the first one. This shouldn't matter much in the

>> >common case since we need to round m to the nearest byte anyway.

>> >

>>

>> AFAIK the paper simply says that as long as the sum of partitions is

>> close to the requested nbits, it's good enough. So I guess we could

>> just roll with that, no need to trim/wrap or something like that.

>

>Hmm, I'm not sure I understand you. I can see not caring to trim wasted

>bits, but we can't set/read off the end of the filter. If we don't

>wrap, we could just skip reading/writing that bit. So a tiny portion of

>access would be at k - 1. The paper is not clear on what to do here,

>but they are explicit in minimizing the absolute value, which could go

>on either side.

>

What I meant is that is that the paper says this:

Given a planned overall length mp for a Bloom filter, we usually

cannot get k prime numbers to make their sum mf to be exactly mp. As

long as the difference between mp and mf is small enough, it neither

causes any trouble for the software implementation nor noticeably

shifts the false positive ratio.

Which I think means we can pick mp, generate k primes with sum mf close

to mp, and just use that with mf bits.

>Also I found a bug:

>

>+ add_local_real_reloption(relopts, "false_positive_rate", + "desired

>false-positive rate for the bloom filters", +

>BLOOM_DEFAULT_FALSE_POSITIVE_RATE, + 0.001, 1.0, offsetof(BloomOptions,

>falsePositiveRate));

>

>When I set fp to 1.0, the reloption code is okay with that, but then

>later the assertion gets triggered.

>

Hmm, yeah. I wonder what to do about that, considering indexes with fp

1.0 are essentially useless.

regards

--

Tomas Vondra http://www.2ndQuadrant.com

PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

- Re: WIP: BRIN multi-range indexes at 2020-09-21 17:42:34 from John Naylor

- Re: WIP: BRIN multi-range indexes at 2020-09-24 21:18:03 from John Naylor

From | Date | Subject | |
---|---|---|---|

Next Message | Andres Freund | 2020-09-21 19:58:39 | Re: Feature improvement for pg_stat_statements |

Previous Message | Andres Freund | 2020-09-21 19:53:11 | Re: Improper use about DatumGetInt32 |