Re: [HACKERS] A design for amcheck heapam verification

From: Peter Geoghegan <pg(at)bowt(dot)ie>
To: Michael Paquier <michael(dot)paquier(at)gmail(dot)com>
Cc: Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [HACKERS] A design for amcheck heapam verification
Date: 2017-12-07 19:37:23
Message-ID: CAH2-Wznm5ZOjS0_DJoWrcm9Us19gzbkm0aTKt5hHprvjHFVHpQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, Nov 28, 2017 at 9:54 PM, Peter Geoghegan <pg(at)bowt(dot)ie> wrote:
> On Tue, Nov 28, 2017 at 9:50 PM, Michael Paquier
> <michael(dot)paquier(at)gmail(dot)com> wrote:
>>> Would that address your concern? There would be an SQL interface, but
>>> it would be trivial.
>>
>> That's exactly what I think you should do, and mentioned so upthread.
>> A SQL interface can also show a good example of how developers can use
>> this API.

Attach revision, v5, adds a new test harness -- test_bloomfilter.

This can be used to experimentally verify that the meets the well
known "1% false positive rate with 9.6 bits per element" standard. It
manages to do exactly that:

postgres=# set client_min_messages = 'debug1';
SET
postgres=# SELECT test_bloomfilter(power => 23, nelements => 873813,
seed => -1, tests => 3);
DEBUG: beginning test #1...
DEBUG: bloom_work_mem (KB): 1024
DEBUG: false positives: 8630 (rate: 0.009876, proportion bits set:
0.517625, seed: 1373191603)
DEBUG: beginning test #2...
DEBUG: bloom_work_mem (KB): 1024
DEBUG: false positives: 8623 (rate: 0.009868, proportion bits set:
0.517623, seed: 406665822)
DEBUG: beginning test #3...
DEBUG: bloom_work_mem (KB): 1024
WARNING: false positives: 8840 (rate: 0.010117, proportion bits set:
0.517748, seed: 398116374)
test_bloomfilter
------------------

(1 row)

Here, we repeat the same test 3 times, varying only the seed value
used for each run.

The last message is a WARNING because we exceed the 1% threshold
(hard-coded into test_bloomfilter.c), though only by a tiny margin,
due only to random variations in seed value. We round up to 10 bits
per element for the regression tests. That's where the *actual*
"nelements" argument comes from within the tests, so pg_regress tests
should never see the WARNING (if they do, that counts as a failure).

I've experimentally observed that we get the 1% false positive rate
with any possible bitset when "nelements" works out at 9.6 bitset bits
per element. Inter-run variation is tiny. With 50 tests, I didn't
observe these same Bloom filter parameters produce a false positive
rate that came near to 1.1%. 1.01% or 1.02% was about as bad as it
got.

There is a fairly extensive README, which I hope will clear up the
theory behind the bloomfilter.c strategy on bitset size and false
positives. Also, there was a regression that I had to fix in
bloomfilter.c, in seeding. It didn't reliably cause variation in the
false positives. And, there was bitrot with the documentation that I
fixed up.

--
Peter Geoghegan

Attachment Content-Type Size
0001-Add-Bloom-filter-data-structure-implementation.patch text/x-patch 26.4 KB
0002-Add-amcheck-verification-of-indexes-against-heap.patch text/x-patch 38.0 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Joshua D. Drake 2017-12-07 19:38:51 Re: Logical replication without a Primary Key
Previous Message Peter Eisentraut 2017-12-07 19:35:04 Re: Transform for pl/perl