Undiagnosed bug in Bloom index

From: Jeff Janes <jeff(dot)janes(at)gmail(dot)com>
To: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Undiagnosed bug in Bloom index
Date: 2016-08-13 18:05:17
Message-ID: CAMkU=1xEUuBphDwDmB1WjN4+td4kpnEniFaTBxnk1xzHCw8_OQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

I am getting corrupted Bloom indexes in which a tuple in the table
heap is not in the index.

I see it as early as commit a9284849b48b, with commit e13ac5586c49c
cherry picked onto it. I don't see it before a9284849b48b because the
test-case seg faults before anything interesting can happen. I think
this is an ab initio bug, either in bloom contrib or in the core index
am. I see it as recently as 371b572, which is as new as I have
tested.

The problem is that an update which must properly update exactly one
row instead updates zero rows.

It takes 5 to 16 hours to reproduce when run as 8 clients on 8 cores.
I suspect it is some kind of race condition, and testing with more
clients on more cores would make it happen faster. If you inject
crash/recovery cycles into the system, it seems to happen sooner. But
crash/recovery cycles are not necessary.

If you use the attached do_nocrash.sh script, the error will generate
a message like:

child abnormal exit update did not update 1 row: key 6838 updated 0E0
at count.pl line 189.\n at count.pl line 197.

(And I've added code so that once this is detected, the script will
soon terminate)

If you want to run do_nocrash.sh, change the first few lines to
hardcode the correct path for the binaries and the temp data directory
(which will be ruthlessly deleted). It will run on an unpatched
server, since crash injection is turned off.

If you want to make it fork more clients, change the 8 in 'perl
count.pl 8 0|| on_error;'

I have preserved a large tarball (215M) of a corrupt data directory.
It was run with the a928484 compilation with e13ac5586 cherrypicked,
and is at https://drive.google.com/open?id=0Bzqrh1SO9FcEci1FQTkwZW9ZU1U.
Or, if you can tell me how to look for myself (pageinspect doesn't
offer much for Bloom).

With that tarball, the first query using the index returns nothing,
will the second forcing a seq scan returns a row:

select * from foo where bloom = md5('6838');

select * from foo where bloom||'' = md5('6838');

The machinery posted here is probably much more elaborate than
necessary to detect the problem. You could probably detect it with
pgbench -N, except that that doesn't check the results to make sure
the expected number of rows were actually selected/updated.

Cheers,

Jeff

Attachment Content-Type Size
count.pl application/octet-stream 8.5 KB
do_nocrash.sh application/x-sh 5.2 KB

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Vladimir Sitnikov 2016-08-13 18:32:53 Re: Slowness of extended protocol
Previous Message Shay Rojansky 2016-08-13 17:33:04 Re: Slowness of extended protocol