Re: [HACKERS] A design for amcheck heapam verification

From: Peter Geoghegan <pg(at)bowt(dot)ie>
To: Andres Freund <andres(at)anarazel(dot)de>
Cc: Andrey Borodin <x4mmm(at)yandex-team(dot)ru>, Michael Paquier <michael(dot)paquier(at)gmail(dot)com>, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [HACKERS] A design for amcheck heapam verification
Date: 2018-03-30 02:42:42
Message-ID: CAH2-Wzn11fiPn6Pqhuk9qqyMP7VZh+ePaOhoRT13c2B+NRjXuw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Thu, Mar 29, 2018 at 6:18 PM, Andres Freund <andres(at)anarazel(dot)de> wrote:
>> This commit adds a test harness extension module, test_bloomfilter. It
>> can be used to get a sense of how the Bloom filter implementation
>> performs under varying conditions.
>
> Maybe add a short paragraph explaining what this'll be used for soon.

Sure.

>> @@ -12,7 +12,7 @@ subdir = src/backend/lib
>> top_builddir = ../../..
>> include $(top_builddir)/src/Makefile.global
>>
>> -OBJS = binaryheap.o bipartite_match.o dshash.o hyperloglog.o ilist.o \
>> - knapsack.o pairingheap.o rbtree.o stringinfo.o
>> +OBJS = binaryheap.o bipartite_match.o bloomfilter.o dshash.o hyperloglog.o \
>> + ilist.o knapsack.o pairingheap.o rbtree.o stringinfo.o
>
> *NOT* for this patch: I really wonder whether we should move towards a
> style where there's only ever a single object per-line. Would make
> things like this easier to view and conflicts easier to resolve.

That does seem like it would be a minor improvement.

>> --- /dev/null
>> +++ b/src/backend/lib/bloomfilter.c
>> @@ -0,0 +1,304 @@
>> +/*-------------------------------------------------------------------------
>> + *
>> + * bloomfilter.c
>> + * Space-efficient set membership testing
>> + *
>> + * A Bloom filter is a probabilistic data structure that is used to test an
>> + * element's membership of a set.
>
> s/of/in/?

Wikipedia says "A Bloom filter is a space-efficient probabilistic data
structure, conceived by Burton Howard Bloom in 1970, that is used to
test whether an element is a member of a set". I think that either
works just as well.

>> False positives are possible, but false
>> + * negatives are not; a test of membership of the set returns either "possibly
>> + * in set" or "definitely not in set". This can be very space efficient when
>> + * individual elements are larger than a few bytes, because elements are hashed
>> + * in order to set bits in the Bloom filter bitset.
>
> The second half of this paragraph isn't easy to understand.

I'll tweak it.

>> + * Elements can be added to the set, but not removed. The more elements that
>> + * are added, the larger the probability of false positives. Caller must hint
>> + * an estimated total size of the set when its Bloom filter is initialized.
>> + * This is used to balance the use of memory against the final false positive
>> + * rate.
>
> s/its Bloom/the Bloom/?

Okay. I'll give you that one.

>> + * The implementation is well suited to data synchronization problems between
>> + * unordered sets, especially where predictable performance is important and
>> + * some false positives are acceptable.
>
> I'm not finding "data synchronization" very descriptive. Makes me think
> of multi-threaded races and such.

Again, this is from Wikipedia:

https://en.wikipedia.org/wiki/Bloom_filter#Data_synchronization

https://en.wikipedia.org/wiki/Data_synchronization

>> +/*
>> + * Create Bloom filter in caller's memory context. This should get a false
>> + * positive rate of between 1% and 2% when bitset is not constrained by memory.
>
> s/should/aims at/?

I think that "should" is accurate, and no less informative. I'm not
going to argue with you, though -- I'll change it.

>> + * total_elems is an estimate of the final size of the set. It ought to be
>> + * approximately correct, but we can cope well with it being off by perhaps a
>> + * factor of five or more. See "Bloom Filters in Probabilistic Verification"
>> + * (Dillinger & Manolios, 2004) for details of why this is the case.
>
> I'd simplify the language here. I'd replace ought with should at the
> very least. Replace we with "the bloom filter" or similar?

I don't see what's wrong with "ought", but okay. I don't see what's
wrong with "we", but okay.

>> + * bloom_work_mem is sized in KB, in line with the general work_mem convention.
>> + * This determines the size of the underlying bitset (trivial bookkeeping space
>> + * isn't counted). The bitset is always sized as a power-of-two number of
>> + * bits, and the largest possible bitset is 512MB. The implementation rounds
>> + * down as needed.
>
> "as needed" should be expanded. Just say ~"Only the required amount of
> memory is allocated"?

Okay.

>> +bloom_filter *
>> +bloom_create(int64 total_elems, int bloom_work_mem, uint32 seed)
>> +{
>> + bloom_filter *filter;
>> + int bloom_power;
>> + uint64 bitset_bytes;
>> + uint64 bitset_bits;
>> +
>> + /*
>> + * Aim for two bytes per element; this is sufficient to get a false
>> + * positive rate below 1%, independent of the size of the bitset or total
>> + * number of elements. Also, if rounding down the size of the bitset to
>> + * the next lowest power of two turns out to be a significant drop, the
>> + * false positive rate still won't exceed 2% in almost all cases.
>> + */
>> + bitset_bytes = Min(bloom_work_mem * 1024L, total_elems * 2);
>> + /* Minimum allowable size is 1MB */
>> + bitset_bytes = Max(1024L * 1024L, bitset_bytes);
>
> Some upcasting might be advisable, to avoid dangers of overflows?

When it comes to sizing work_mem, using long literals to go from KB to
bytes is How It's Done™. I actually think that's silly myself, because
it's based on the assumption that long is wider than int, even though
it isn't on Windows. But that's okay because we have the old work_mem
size limits on Windows.

What would the upcasting you have in mind look like?

>> + /* Use 64-bit hashing to get two independent 32-bit hashes */
>> + hash = DatumGetUInt64(hash_any_extended(elem, len, filter->seed));
>
> Hm. Is that smart given how some hash functions are defined? E.g. for
> int8 the higher bits aren't really that independent for small values:

Robert suggested that I do this. I don't think that we need to make it
about the quality of the hash function that we have available. That
really seems like a distinct question to me. It seems clear that this
ought to be fine (or should be fine, if you prefer). I understand why
you're asking about this, but it's not scalable to ask every user of a
hash function to care that it might be a bit crap. Hash functions
aren't supposed to be a bit crap.

You may be nervous about the overall quality of the Bloom filter,
which is understandable. Check out the test harness, and how it can be
used [1]. This shows that the theoretical/expected false positive rate
[2] is consistently achieved with random seed values, even as the size
of the Bloom filter is varied, right up until the largest supported
size (512MB). The margin of error is tiny - certainly much less than a
practical use-case could ever care about.

>> +/*
>> + * Calculate "val MOD m" inexpensively.
>> + *
>> + * Assumes that m (which is bitset size) is a power-of-two.
>> + *
>> + * Using a power-of-two number of bits for bitset size allows us to use bitwise
>> + * AND operations to calculate the modulo of a hash value. It's also a simple
>> + * way of avoiding the modulo bias effect.
>> + */
>> +static inline uint32
>> +mod_m(uint32 val, uint64 m)
>> +{
>> + Assert(m <= PG_UINT32_MAX + UINT64CONST(1));
>> + Assert(((m - 1) & m) == 0);
>> +
>> + return val & (m - 1);
>> +}
>
> What's up with the two different widths here?

For a 512MB Bloom filter, m can be UINT_MAX + 1. This is referenced earlier on.

>> +#ifndef _BLOOMFILTER_H_
>> +#define _BLOOMFILTER_H_
>
> Names starting with an underscore followed by an uppercase letter are
> reserved. Yes, we have some already. No, we shouldn't introduce further ones.

Okay. Will fix.

>> From 71878742061500b969faf7a7cff3603d644c90ca Mon Sep 17 00:00:00 2001
>> From: Peter Geoghegan <pg(at)bowt(dot)ie>
>> Date: Tue, 2 May 2017 00:19:24 -0700
>> Subject: [PATCH v7 2/2] Add amcheck verification of indexes against heap.
>>
>> Add a new, optional capability to bt_index_check() and
>> bt_index_parent_check(): callers can check that each heap tuple that
>> ought to have an index entry does in fact have one. This happens at the
>> end of the existing verification checks.
>
> And here we get back to why I last year though this interface is
> bad. Now this really can't properly described as a pure index check
> anymore, and we need to add the same functionality to multiple
> functions.

If it makes you feel any better, I'm pretty much done with
bt_index_check() and bt_index_parent_check(). This patch will more
than likely be the last to revise their interface; there will be new
functions next year. That's why I thought it was okay last year.

>> +--
>> +-- bt_index_check()
>> +--
>> +DROP FUNCTION bt_index_check(regclass);
>> +CREATE FUNCTION bt_index_check(index regclass,
>> + heapallindexed boolean DEFAULT false)
>> +RETURNS VOID
>> +AS 'MODULE_PATHNAME', 'bt_index_check'
>> +LANGUAGE C STRICT PARALLEL RESTRICTED;
>
> This breaks functions et al referencing the existing function.

This sounds like a general argument against ever changing a function's
signature. It's not like we haven't done that a number of times in
extensions like pageinspect. Does it really matter?

> Also, I
> don't quite recall the rules, don't we have to drop the function from
> the extension first?

But...I did drop the function?

> Given the PG_NARGS() checks I don't understand why don't just create a
> separate two-argument SQL function above? If you rely on the default
> value you don't need it anyway?

I don't need it. But a user of amcheck might appreciate it.

>> + /*
>> + * * Heap contains unindexed/malformed tuples check *
>> + */
>
> I'd reorder this to "Check whether heap contains ...".

Okay.

>> + if (state->heapallindexed)
>> + {
>> + IndexInfo *indexinfo = BuildIndexInfo(state->rel);
>> + HeapScanDesc scan;
>> +
>> + /*
>> + * Create our own scan for IndexBuildHeapScan(), like a parallel index
>> + * build. We do things this way because it lets us use the MVCC
>> + * snapshot we acquired before index fingerprinting began in the
>> + * !readonly case.
>> + */
>
> I'd shorten the middle part out, so it's "IndexBuildHeapScan(), so we
> can register an MVCC snapshot acquired before..."

Okay.

>> + scan = heap_beginscan_strat(state->heaprel, /* relation */
>> + snapshot, /* snapshot */
>> + 0, /* number of keys */
>> + NULL, /* scan key */
>> + true, /* buffer access strategy OK */
>> + true); /* syncscan OK? */
>> +
>> + /*
>> + * Scan will behave as the first scan of a CREATE INDEX CONCURRENTLY
>> + * behaves when only AccessShareLock held. This is really only needed
>> + * to prevent confusion within IndexBuildHeapScan() about how to
>> + * interpret the state we pass.
>> + */
>> + indexinfo->ii_Concurrent = !state->readonly;
>
> That's not very descriptive.

The point is that it doesn't expect an MVCC snapshot when we don't say
we're CIC. The assertions that I added to IndexBuildHeapScan() go
nuts, for example.

I'll change this to "This is needed so that IndexBuildHeapScan() knows
to expect an MVCC snapshot".

>> + /* Fingerprint leaf page tuples (those that point to the heap) */
>> + if (state->heapallindexed && P_ISLEAF(topaque) && !ItemIdIsDead(itemid))
>> + bloom_add_element(state->filter, (unsigned char *) itup,
>> + IndexTupleSize(itup));
>
> So, if we didn't use IndexBuildHeapScan(), we could also check whether
> dead entries at least have a corresponding item on the page, right? I'm
> not asking to change, mostly curious.

It's true that the dead entries ought to point to a valid heap
tuple/root heap-only tuple, since the LP_DEAD bit is just a hint.
However, you'd have to hold a buffer lock on the leaf page throughout,
since a write that would otherwise result in a page split is going to
recycling LP_DEAD items.

FWIW, the !ItemIdIsDead() thing isn't very important, because LP_DEAD
bits tend to get set quickly when the heap happens to be corrupt in a
way that makes corresponding heap tuples look dead when they shouldn't
be. But it's an opportunity to add less stuff to the Bloom filter,
which might make a small difference. Also, it might have some minor
educational value, for hackers that want to learn more about nbtree,
which remains a secondary goal.

>> +/*
>> + * Per-tuple callback from IndexBuildHeapScan, used to determine if index has
>> + * all the entries that definitely should have been observed in leaf pages of
>> + * the target index (that is, all IndexTuples that were fingerprinted by our
>> + * Bloom filter). All heapallindexed checks occur here.
>
> The last sentence isn't entirely fully true ;), given that we check for
> the bloom insertion above. s/checks/verification/?

Not sure what you mean. Is your point that we could have an error from
within IndexBuildHeapScan() and friends, as opposed to from this
callback, such as the error that we saw during the "freeze the dead"
business [3]?

The bloom_add_element() calls cannot raise errors; they're just there
because they're needed to make the check in the callback
bt_tuple_present_callback() have a summarizing structure to work off
of.

> We should be able to get this into v11...

That's a relief. Thanks.

[1] https://postgr.es/m/CAH2-Wznm5ZOjS0_DJoWrcm9Us19gzbkm0aTKt5hHprvjHFVHpQ@mail.gmail.com
[2] https://hur.st/bloomfilter/
[3] https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=8ecdc2ffe3da3a84d01e51c784ec3510157c893b
--
Peter Geoghegan

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Michael Paquier 2018-03-30 02:46:31 Re: Speedup of relation deletes during recovery
Previous Message Michael Paquier 2018-03-30 02:37:19 Re: Change RangeVarGetRelidExtended() to take flags argument?