Re: atomic pin/unpin causing errors

From: Andres Freund <andres(at)anarazel(dot)de>
To: Jeff Janes <jeff(dot)janes(at)gmail(dot)com>
Cc: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: atomic pin/unpin causing errors
Date: 2016-04-30 00:10:55
Message-ID: 20160430001055.nc2rdgw3uqkckd4j@alap3.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

On 2016-04-29 10:38:55 -0700, Jeff Janes wrote:
> I've bisected the errors I was seeing, discussed in
> http://www.postgresql.org/message-id/CAMkU=1xQEhC0Ok4d+tkjFQ1nvUhO37PYRKhJP6Q8oxifMx7OwA@mail.gmail.com
>
> It look like they first appear in:
>
> commit 48354581a49c30f5757c203415aa8412d85b0f70
> Author: Andres Freund <andres(at)anarazel(dot)de>
> Date: Sun Apr 10 20:12:32 2016 -0700
>
> Allow Pin/UnpinBuffer to operate in a lockfree manner.
>
>
> I get the errors:
>
> ERROR: attempted to delete invisible tuple
> STATEMENT: update foo set count=count+1,text_array=$1 where text_array @> $2
>
> And also:
>
> ERROR: unexpected chunk number 1 (expected 2) for toast value
> 85223889 in pg_toast_16424
> STATEMENT: update foo set count=count+1 where text_array @> $1
>
> Once these errors start occurring, they happen often. Usually the
> "attempted to delete invisible tuple" happens first.

That kind of seems to implicate clog/vacuuming or something like that
being involved.

> These errors show up after about 9 hours of run time. The timing is
> predictable enough that I don't think it is a purely stochastic race
> condition.

Hm. I've a bit of a hard time believing that such a timing could be
caused by the above patch. How confident that it's that patch, and not
just changed timing due to performance changes? And you definitely can
only reproduce the problem with the regular crash cycles?

> It seems like some counter variable is overflowing. But
> it is not the ShmemVariableCache->nextXid counter, as I previously
> speculated. This test does not advance that fast enough to for it to
> wrap around within 9 hours of run time. But I am at a loss of what
> other variable it might be. Since the system goes through a crash and
> recovery every few seconds, any backend-local counters or
> shared-memory counters would get reset upon recovery. Right?

A lot of those counters will be re-set based on WAL contents. So if
they're corrupted once, several of them are prone to continue to be
corrupted.

Greetings,

Andres Freund

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Andreas Seltenreich 2016-04-30 00:28:22 Re: [sqlsmith] Failed assertion in BecomeLockGroupLeader
Previous Message Andres Freund 2016-04-29 23:58:37 Re: [BUGS] Breakage with VACUUM ANALYSE + partitions