atomic pin/unpin causing errors

From: Jeff Janes <jeff(dot)janes(at)gmail(dot)com>
To: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: atomic pin/unpin causing errors
Date: 2016-04-29 17:38:55
Message-ID: CAMkU=1w85Dqt766AUrCnyqCXfZ+rsk1witAc_=v5+Pce93Sftw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

I've bisected the errors I was seeing, discussed in
http://www.postgresql.org/message-id/CAMkU=1xQEhC0Ok4d+tkjFQ1nvUhO37PYRKhJP6Q8oxifMx7OwA@mail.gmail.com

It look like they first appear in:

commit 48354581a49c30f5757c203415aa8412d85b0f70
Author: Andres Freund <andres(at)anarazel(dot)de>
Date: Sun Apr 10 20:12:32 2016 -0700

Allow Pin/UnpinBuffer to operate in a lockfree manner.

I get the errors:

ERROR: attempted to delete invisible tuple
STATEMENT: update foo set count=count+1,text_array=$1 where text_array @> $2

And also:

ERROR: unexpected chunk number 1 (expected 2) for toast value
85223889 in pg_toast_16424
STATEMENT: update foo set count=count+1 where text_array @> $1

Once these errors start occurring, they happen often. Usually the
"attempted to delete invisible tuple" happens first.

These errors show up after about 9 hours of run time. The timing is
predictable enough that I don't think it is a purely stochastic race
condition. It seems like some counter variable is overflowing. But
it is not the ShmemVariableCache->nextXid counter, as I previously
speculated. This test does not advance that fast enough to for it to
wrap around within 9 hours of run time. But I am at a loss of what
other variable it might be. Since the system goes through a crash and
recovery every few seconds, any backend-local counters or
shared-memory counters would get reset upon recovery. Right?

I think the invisible tuple referred to might be a tuple in the toast
table, not in the parent table.

I don't see the problem with an cassert-enabled, probably because it
is just too slow to ever reach the point where the problem occurs.

Any suggestions about where or how to look? I don't know if the
"attempted to delete invisible tuple" is the bug itself, or is just
tripping over corruption left behind by someone else.

(This was all run using Teodor's test-enabling patch
gin_alone_cleanup-4.patch, so as not to change horses in midstream.
Now that a version of that patch has been committed, I will try to
repeat this in HEAD)

Cheers,

Jeff

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Alvaro Herrera 2016-04-29 17:49:38 Re: Replying to a pgsql-committers email by CC'ing hackers
Previous Message Alvaro Herrera 2016-04-29 17:31:31 Re: Add jsonb_compact(...) for whitespace-free jsonb to text