Re: GIN data corruption bug(s) in 9.6devel

From: Jeff Janes <jeff(dot)janes(at)gmail(dot)com>
To: Noah Misch <noah(at)leadboat(dot)com>
Cc: Teodor Sigaev <teodor(at)sigaev(dot)ru>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: GIN data corruption bug(s) in 9.6devel
Date: 2016-04-22 21:03:01
Message-ID: CAMkU=1xQEhC0Ok4d+tkjFQ1nvUhO37PYRKhJP6Q8oxifMx7OwA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Thu, Apr 21, 2016 at 11:00 PM, Noah Misch <noah(at)leadboat(dot)com> wrote:
> On Mon, Apr 18, 2016 at 05:48:17PM +0300, Teodor Sigaev wrote:
>> >>Added, see attached patch (based on v3.1)
>> >
>> >With this applied, I am getting a couple errors I have not seen before
>> >after extensive crash recovery testing:
>> >ERROR: attempted to delete invisible tuple
>> >ERROR: unexpected chunk number 1 (expected 2) for toast value
>> >100338365 in pg_toast_16425
>> Huh, seems, it's not related to GIN at all... Indexes don't play with toast
>> machinery. The single place where this error can occur is a heap_delete() -
>> deleting already deleted tuple.
>
> Like you, I would not expect gin_alone_cleanup-4.patch to cause such an error.
> I get the impression Jeff has a test case that he had run in many iterations
> against the unpatched baseline. I also get the impression that a similar or
> smaller number of its iterations against gin_alone_cleanup-4.patch triggered
> these two errors (once apiece, or multiple times?). Jeff, is that right?

Because the unpatched baseline suffers from the bug which was the
original topic of this thread, I haven't been able to test against the
original baseline. It would fail from that other bug before it ran
long enough to hit these ones. Both errors start within a few minutes
of each other, but do not appear to be related other than that. Once
they start happening, they occur repeatedly.

> Could you describe the test case in sufficient detail for Teodor to reproduce
> your results?

I spawn a swarm of processes to update a counter in a randomly chosen
row, selected via the gin index. They do this as fast as possible
until the server intentionally crashes. When it recovers, they
compare notes and see if the results are consistent.

But in this case, the problem is not inconsistent results, but rather
errors during the updating stage.

The patch introduces a mechanism to crash the server, a mechanism to
fast-forward the XID, and some additional logging that I sometimes
find useful (which I haven't been using in this case, but don't want
to rip it out)

The perl script implements the core of the test harness.

The shell script sets up the server (using hard coded paths for the
data and the binaries, so will need to be changed), and then calls the
perl script in a loop.

Running on an 8 core system, I've never seen it have a problem in less
than 9 hours of run time.

This produces copious amounts of logging to stdout. This is how I
look through the log files to find this particular problem:

sh do.sh >& do.err &

tail -f do.err | fgrep ERROR

The more I think about it, the more I think gin is just an innocent
bystander, for which I just happen to have a particularly demanding
test. I think something about snapshots and wrap-around may be
broken.

Cheers,

Jeff

Attachment Content-Type Size
count.pl application/octet-stream 8.4 KB
crash_REL9_6.patch application/octet-stream 13.0 KB
do.sh application/x-sh 4.8 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Robert Haas 2016-04-22 21:19:14 Re: EXPLAIN VERBOSE with parallel Aggregate
Previous Message Tom Lane 2016-04-22 20:55:47 Re: [COMMITTERS] pgsql: Inline initial comparisons in TestForOldSnapshot()