Quick Links

Re: WIP: BRIN multi-range indexes

From:	Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
To:	John Naylor <john(dot)naylor(at)2ndquadrant(dot)com>
Cc:	Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: WIP: BRIN multi-range indexes
Date:	2020-09-18 22:27:02
Message-ID:	20200918222702.omsieaphfj3ctqg3@development
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

On Fri, Sep 18, 2020 at 05:06:49PM -0400, John Naylor wrote:
>On Fri, Sep 18, 2020 at 7:40 AM Tomas Vondra
><tomas(dot)vondra(at)2ndquadrant(dot)com> wrote:
>>
>> On Thu, Sep 17, 2020 at 06:48:11PM -0400, John Naylor wrote:
>> >I wrote:
>> >
>> >> Hmm, I came across that paper while doing background reading. Okay,
>> >> now I get that "% (filter->nbits - 1)" is the second hash function in
>> >> that scheme. But now I wonder if that second function should actually
>> >> act on the passed "value" (the original hash), so that they are
>> >> actually independent, as required. In the language of that paper, the
>> >> patch seems to have
>> >>
>> >> g(x) = h1(x) + i*h2(h1(x)) + f(i)
>> >>
>> >> instead of
>> >>
>> >> g(x) = h1(x) + i*h2(x) + f(i)
>> >>
>> >> Concretely, I'm wondering if it should be:
>> >>
>> >> big_h = DatumGetUint32(hash_uint32(value));
>> >> h = big_h % filter->nbits;
>> >> -d = big_h % (filter->nbits - 1);
>> >> +d = value % (filter->nbits - 1);
>> >>
>> >> But I could be wrong.
>> >
>> >I'm wrong -- if we use different operands to the moduli, we throw away
>> >the assumption of co-primeness. But I'm still left wondering why we
>> >have to re-hash the hash for this to work. In any case, there should
>> >be some more documentation around the core algorithm, so that future
>> >readers are not left scratching their heads.
>> >
>>
>> Hmm, good question. I think we don't really need to hash it twice. It
>> does not rally achieve anything - it won't reduce number of collisions
>> or anything like that.
>
>Yeah, looking back at the discussion you linked previously, I think
>it's a holdover from when the uint32 was rehashed with k different
>seeds. Anyway, after thinking about it some more, I still have doubts
>about the mapping algorithm. There are two stages to a hash mapping --
>hashing and modulus. I don't think a single hash function (whether
>rehashed or not) can be turned into two independent functions via a
>choice of second modulus. At least, that's not what the Kirsch &
>Mitzenmacher paper is claiming. Since we're not actually applying two
>independent hash functions on the scan key, we're kind of shooting in
>the dark.
>

OK. I admit the modulo by nbits and (nbits - 1) is a bit suspicious, so
you may be right this is not quite correct construction.

The current scheme was meant to reduce the number of expensive hashing
calls (especially for low fpr values we may require quite a few of
those, easily 10 or more.

But maybe we could still use this scheme by actually computing

h1 = hash_uint32_extended(value, seed1);
h2 = hash_uint32_extended(value, seed2);

and then use this as the independent hash functions. I think that would
meet the requirements of the paper.

>It turns out there is something called a one-hash bloom filter, and
>the paper in [1] has a straightforward algorithm. Since we can
>implement it exactly as stated in the paper, that gives me more
>confidence in the real-world false positive rate. It goes like this:
>
>Partition the filter bitmap into k partitions of similar but unequal
>length, corresponding to consecutive prime numbers. Use the primes for
>moduli of the uint32 value and map it to the bit of the corresponding
>partition. For a simple example, let's use 7, 11, 13 for partitions in
>a filter of size 31. The three bits are:
>
>value % 7
>7 + (value % 11)
>7 + 11 + (value % 13)
>
>We could store a const array of the first 256 primes. The largest such
>prime is 1613, so with k=7 we can support up to ~11k bits, which is
>more than we'd like to store anyway. Then we store the array index of
>the largest prime in the 8bits of padding we currently have in
>BloomFilter struct.
>

Why would 11k bits be more than we'd like to store? Assuming we could
use the whole 8kB page for the bloom filter, that'd be about 64k bits.
In practice there'd be a bit of overhead (page header ...) but it's
still much more than 11k bits. But I guess we can simply make the table
of primes a bit larger, right?

FWIW I don't think we need to be that careful about the space to store
stuff in padding etc. If we can - great, but compared to the size of the
filter it's negligible and I'd prioritize simplicity over a byte or two.

>One wrinkle is that the sum of k primes is not going to match m
>exactly. If the sum is too small, we can trim a few bits off of the
>filter bitmap. If the sum is too large, the last partition can spill
>into the front of the first one. This shouldn't matter much in the
>common case since we need to round m to the nearest byte anyway.
>

AFAIK the paper simply says that as long as the sum of partitions is
close to the requested nbits, it's good enough. So I guess we could just
roll with that, no need to trim/wrap or something like that.

>This should be pretty straightforward to turn into code and I can take
>a stab at it. Thoughts?
>

Sure, go ahead. I'm happy someone is actually looking at those patches
and proposing alternative solutions, and this might easily be a better
hashing scheme.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

In response to

Re: WIP: BRIN multi-range indexes at 2020-09-18 21:06:49 from John Naylor

Responses

Re: WIP: BRIN multi-range indexes at 2020-09-21 17:42:34 from John Naylor

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Tom Lane	2020-09-18 22:32:52	Re: recovering from "found xmin ... from before relfrozenxid ..."
Previous Message	Bruce Momjian	2020-09-18 22:18:11	Re: Inconsistent Japanese name order in v13 contributors list