Re: [HACKERS] A design for amcheck heapam verification

From: Andres Freund <andres(at)anarazel(dot)de>
To: Peter Geoghegan <pg(at)bowt(dot)ie>
Cc: Andrey Borodin <x4mmm(at)yandex-team(dot)ru>, Michael Paquier <michael(dot)paquier(at)gmail(dot)com>, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [HACKERS] A design for amcheck heapam verification
Date: 2018-03-30 01:18:34
Message-ID: 20180330011834.4hnr7bhvid26zgpt@alap3.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

On 2018-03-26 20:20:57 -0700, Peter Geoghegan wrote:
> From ede1ba731dc818172a94adbb6331323c1f2b1170 Mon Sep 17 00:00:00 2001
> From: Peter Geoghegan <pg(at)bowt(dot)ie>
> Date: Thu, 24 Aug 2017 20:58:21 -0700
> Subject: [PATCH v7 1/2] Add Bloom filter data structure implementation.
>
> A Bloom filter is a space-efficient, probabilistic data structure that
> can be used to test set membership. Callers will sometimes incur false
> positives, but never false negatives. The rate of false positives is a
> function of the total number of elements and the amount of memory
> available for the Bloom filter.
>
> Two classic applications of Bloom filters are cache filtering, and data
> synchronization testing. Any user of Bloom filters must accept the
> possibility of false positives as a cost worth paying for the benefit in
> space efficiency.
>
> This commit adds a test harness extension module, test_bloomfilter. It
> can be used to get a sense of how the Bloom filter implementation
> performs under varying conditions.

Maybe add a short paragraph explaining what this'll be used for soon.

> @@ -12,7 +12,7 @@ subdir = src/backend/lib
> top_builddir = ../../..
> include $(top_builddir)/src/Makefile.global
>
> -OBJS = binaryheap.o bipartite_match.o dshash.o hyperloglog.o ilist.o \
> - knapsack.o pairingheap.o rbtree.o stringinfo.o
> +OBJS = binaryheap.o bipartite_match.o bloomfilter.o dshash.o hyperloglog.o \
> + ilist.o knapsack.o pairingheap.o rbtree.o stringinfo.o

*NOT* for this patch: I really wonder whether we should move towards a
style where there's only ever a single object per-line. Would make
things like this easier to view and conflicts easier to resolve.

> --- /dev/null
> +++ b/src/backend/lib/bloomfilter.c
> @@ -0,0 +1,304 @@
> +/*-------------------------------------------------------------------------
> + *
> + * bloomfilter.c
> + * Space-efficient set membership testing
> + *
> + * A Bloom filter is a probabilistic data structure that is used to test an
> + * element's membership of a set.

s/of/in/?

> False positives are possible, but false
> + * negatives are not; a test of membership of the set returns either "possibly
> + * in set" or "definitely not in set". This can be very space efficient when
> + * individual elements are larger than a few bytes, because elements are hashed
> + * in order to set bits in the Bloom filter bitset.

The second half of this paragraph isn't easy to understand.

> + * Elements can be added to the set, but not removed. The more elements that
> + * are added, the larger the probability of false positives. Caller must hint
> + * an estimated total size of the set when its Bloom filter is initialized.
> + * This is used to balance the use of memory against the final false positive
> + * rate.

s/its Bloom/the Bloom/?

> + * The implementation is well suited to data synchronization problems between
> + * unordered sets, especially where predictable performance is important and
> + * some false positives are acceptable.

I'm not finding "data synchronization" very descriptive. Makes me think
of multi-threaded races and such.

> +/*
> + * Create Bloom filter in caller's memory context. This should get a false
> + * positive rate of between 1% and 2% when bitset is not constrained by memory.

s/should/aims at/?

> + * total_elems is an estimate of the final size of the set. It ought to be
> + * approximately correct, but we can cope well with it being off by perhaps a
> + * factor of five or more. See "Bloom Filters in Probabilistic Verification"
> + * (Dillinger & Manolios, 2004) for details of why this is the case.

I'd simplify the language here. I'd replace ought with should at the
very least. Replace we with "the bloom filter" or similar?

> + * bloom_work_mem is sized in KB, in line with the general work_mem convention.
> + * This determines the size of the underlying bitset (trivial bookkeeping space
> + * isn't counted). The bitset is always sized as a power-of-two number of
> + * bits, and the largest possible bitset is 512MB. The implementation rounds
> + * down as needed.

"as needed" should be expanded. Just say ~"Only the required amount of
memory is allocated"?

> +bloom_filter *
> +bloom_create(int64 total_elems, int bloom_work_mem, uint32 seed)
> +{
> + bloom_filter *filter;
> + int bloom_power;
> + uint64 bitset_bytes;
> + uint64 bitset_bits;
> +
> + /*
> + * Aim for two bytes per element; this is sufficient to get a false
> + * positive rate below 1%, independent of the size of the bitset or total
> + * number of elements. Also, if rounding down the size of the bitset to
> + * the next lowest power of two turns out to be a significant drop, the
> + * false positive rate still won't exceed 2% in almost all cases.
> + */
> + bitset_bytes = Min(bloom_work_mem * 1024L, total_elems * 2);
> + /* Minimum allowable size is 1MB */
> + bitset_bytes = Max(1024L * 1024L, bitset_bytes);

Some upcasting might be advisable, to avoid dangers of overflows?

> +/*
> + * Generate k hash values for element.
> + *
> + * Caller passes array, which is filled-in with k values determined by hashing
> + * caller's element.
> + *
> + * Only 2 real independent hash functions are actually used to support an
> + * interface of up to MAX_HASH_FUNCS hash functions; enhanced double hashing is
> + * used to make this work. The main reason we prefer enhanced double hashing
> + * to classic double hashing is that the latter has an issue with collisions
> + * when using power-of-two sized bitsets. See Dillinger & Manolios for full
> + * details.
> + */
> +static void
> +k_hashes(bloom_filter *filter, uint32 *hashes, unsigned char *elem, size_t len)
> +{
> + uint64 hash;
> + uint32 x, y;
> + uint64 m;
> + int i;
> +
> + /* Use 64-bit hashing to get two independent 32-bit hashes */
> + hash = DatumGetUInt64(hash_any_extended(elem, len, filter->seed));

Hm. Is that smart given how some hash functions are defined? E.g. for
int8 the higher bits aren't really that independent for small values:

Datum
hashint8(PG_FUNCTION_ARGS)
{
/*
* The idea here is to produce a hash value compatible with the values
* produced by hashint4 and hashint2 for logically equal inputs; this is
* necessary to support cross-type hash joins across these input types.
* Since all three types are signed, we can xor the high half of the int8
* value if the sign is positive, or the complement of the high half when
* the sign is negative.
*/
int64 val = PG_GETARG_INT64(0);
uint32 lohalf = (uint32) val;
uint32 hihalf = (uint32) (val >> 32);

lohalf ^= (val >= 0) ? hihalf : ~hihalf;

return hash_uint32(lohalf);
}

Datum
hashint8extended(PG_FUNCTION_ARGS)
{
/* Same approach as hashint8 */
int64 val = PG_GETARG_INT64(0);
uint32 lohalf = (uint32) val;
uint32 hihalf = (uint32) (val >> 32);

lohalf ^= (val >= 0) ? hihalf : ~hihalf;

return hash_uint32_extended(lohalf, PG_GETARG_INT64(1));
}

> +/*
> + * Calculate "val MOD m" inexpensively.
> + *
> + * Assumes that m (which is bitset size) is a power-of-two.
> + *
> + * Using a power-of-two number of bits for bitset size allows us to use bitwise
> + * AND operations to calculate the modulo of a hash value. It's also a simple
> + * way of avoiding the modulo bias effect.
> + */
> +static inline uint32
> +mod_m(uint32 val, uint64 m)
> +{
> + Assert(m <= PG_UINT32_MAX + UINT64CONST(1));
> + Assert(((m - 1) & m) == 0);
> +
> + return val & (m - 1);
> +}

What's up with the two different widths here?

> @@ -0,0 +1,27 @@
> +/*-------------------------------------------------------------------------
> + *
> + * bloomfilter.h
> + * Space-efficient set membership testing
> + *
> + * Copyright (c) 2018, PostgreSQL Global Development Group
> + *
> + * IDENTIFICATION
> + * src/include/lib/bloomfilter.h
> + *
> + *-------------------------------------------------------------------------
> + */
> +#ifndef _BLOOMFILTER_H_
> +#define _BLOOMFILTER_H_

Names starting with an underscore followed by an uppercase letter are
reserved. Yes, we have some already. No, we shouldn't introduce further ones.

> From 71878742061500b969faf7a7cff3603d644c90ca Mon Sep 17 00:00:00 2001
> From: Peter Geoghegan <pg(at)bowt(dot)ie>
> Date: Tue, 2 May 2017 00:19:24 -0700
> Subject: [PATCH v7 2/2] Add amcheck verification of indexes against heap.
>
> Add a new, optional capability to bt_index_check() and
> bt_index_parent_check(): callers can check that each heap tuple that
> ought to have an index entry does in fact have one. This happens at the
> end of the existing verification checks.

And here we get back to why I last year though this interface is
bad. Now this really can't properly described as a pure index check
anymore, and we need to add the same functionality to multiple
functions.

> +--
> +-- bt_index_check()
> +--
> +DROP FUNCTION bt_index_check(regclass);
> +CREATE FUNCTION bt_index_check(index regclass,
> + heapallindexed boolean DEFAULT false)
> +RETURNS VOID
> +AS 'MODULE_PATHNAME', 'bt_index_check'
> +LANGUAGE C STRICT PARALLEL RESTRICTED;

This breaks functions et al referencing the existing function. Also, I
don't quite recall the rules, don't we have to drop the function from
the extension first?

> /*
> - * bt_index_check(index regclass)
> + * bt_index_check(index regclass, heapallindexed boolean)
> *
> * Verify integrity of B-Tree index.
> *
> * Acquires AccessShareLock on heap & index relations. Does not consider
> - * invariants that exist between parent/child pages.
> + * invariants that exist between parent/child pages. Optionally verifies
> + * that heap does not contain any unindexed or incorrectly indexed tuples.
> */
> Datum
> bt_index_check(PG_FUNCTION_ARGS)
> {
> Oid indrelid = PG_GETARG_OID(0);
> + bool heapallindexed = false;
>
> - bt_index_check_internal(indrelid, false);
> + if (PG_NARGS() == 2)
> + heapallindexed = PG_GETARG_BOOL(1);
> +
> + bt_index_check_internal(indrelid, false, heapallindexed);
>
> PG_RETURN_VOID();
> }

Given the PG_NARGS() checks I don't understand why don't just create a
separate two-argument SQL function above? If you rely on the default
value you don't need it anyway?

> + /*
> + * * Heap contains unindexed/malformed tuples check *
> + */

I'd reorder this to "Check whether heap contains ...".

> + if (state->heapallindexed)
> + {
> + IndexInfo *indexinfo = BuildIndexInfo(state->rel);
> + HeapScanDesc scan;
> +
> + /*
> + * Create our own scan for IndexBuildHeapScan(), like a parallel index
> + * build. We do things this way because it lets us use the MVCC
> + * snapshot we acquired before index fingerprinting began in the
> + * !readonly case.
> + */

I'd shorten the middle part out, so it's "IndexBuildHeapScan(), so we
can register an MVCC snapshot acquired before..."

> + scan = heap_beginscan_strat(state->heaprel, /* relation */
> + snapshot, /* snapshot */
> + 0, /* number of keys */
> + NULL, /* scan key */
> + true, /* buffer access strategy OK */
> + true); /* syncscan OK? */
> +
> + /*
> + * Scan will behave as the first scan of a CREATE INDEX CONCURRENTLY
> + * behaves when only AccessShareLock held. This is really only needed
> + * to prevent confusion within IndexBuildHeapScan() about how to
> + * interpret the state we pass.
> + */
> + indexinfo->ii_Concurrent = !state->readonly;

That's not very descriptive.

> + /* Fingerprint leaf page tuples (those that point to the heap) */
> + if (state->heapallindexed && P_ISLEAF(topaque) && !ItemIdIsDead(itemid))
> + bloom_add_element(state->filter, (unsigned char *) itup,
> + IndexTupleSize(itup));

So, if we didn't use IndexBuildHeapScan(), we could also check whether
dead entries at least have a corresponding item on the page, right? I'm
not asking to change, mostly curious.

> +/*
> + * Per-tuple callback from IndexBuildHeapScan, used to determine if index has
> + * all the entries that definitely should have been observed in leaf pages of
> + * the target index (that is, all IndexTuples that were fingerprinted by our
> + * Bloom filter). All heapallindexed checks occur here.

The last sentence isn't entirely fully true ;), given that we check for
the bloom insertion above. s/checks/verification/?

We should be able to get this into v11...

Greetings,

Andres Freund

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Chapman Flack 2018-03-30 01:18:59 Re: [HACKERS] AdvanceXLInsertBuffer vs. WAL segment compressibility
Previous Message Bruce Momjian 2018-03-30 01:12:52 Re: cannot drop replication slot if server is running in single-user mode