Re: GIN data corruption bug(s) in 9.6devel

From: Noah Misch <noah(at)leadboat(dot)com>
To: Jeff Janes <jeff(dot)janes(at)gmail(dot)com>, Teodor Sigaev <teodor(at)sigaev(dot)ru>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: GIN data corruption bug(s) in 9.6devel
Date: 2016-04-22 06:00:46
Message-ID: 20160422060046.GC2042217@tornado.leadboat.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Mon, Apr 18, 2016 at 05:48:17PM +0300, Teodor Sigaev wrote:
> >>Added, see attached patch (based on v3.1)
> >
> >With this applied, I am getting a couple errors I have not seen before
> >after extensive crash recovery testing:
> >ERROR: attempted to delete invisible tuple
> >ERROR: unexpected chunk number 1 (expected 2) for toast value
> >100338365 in pg_toast_16425
> Huh, seems, it's not related to GIN at all... Indexes don't play with toast
> machinery. The single place where this error can occur is a heap_delete() -
> deleting already deleted tuple.

Like you, I would not expect gin_alone_cleanup-4.patch to cause such an error.
I get the impression Jeff has a test case that he had run in many iterations
against the unpatched baseline. I also get the impression that a similar or
smaller number of its iterations against gin_alone_cleanup-4.patch triggered
these two errors (once apiece, or multiple times?). Jeff, is that right? If
so, until we determine the cause, we should assume the cause arrived in
gin_alone_cleanup-4.patch. An error in pointer arithmetic or locking might
corrupt an unrelated buffer, leading to this symptom.

> >I've restarted the test harness with intentional crashes turned off,
> >to see if the problems are related to crash recovery or are more
> >generic than that.
> >
> >I've never seen these particular problems before, so don't have much
> >insight into what might be going on or how to debug it.

Could you describe the test case in sufficient detail for Teodor to reproduce
your results?

> Check my reasoning: In version 4 I added a remebering of tail of pending
> list into blknoFinish variable. And when we read page which was a tail on
> cleanup start then we sets cleanupFinish variable and after cleaning that
> page we will stop further cleanup. Any insert caused during cleanup will be
> placed after blknoFinish (corner case: in that page), so, vacuum should not
> miss tuples marked as deleted.

Would any hacker volunteer to review Teodor's reasoning here?

Thanks,
nm

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Michael Paquier 2016-04-22 06:46:51 Re: VS 2015 support in src/tools/msvc
Previous Message Amit Kapila 2016-04-22 05:36:43 Re: max_parallel_degree > 0 for 9.6 beta