Re: SKIP LOCKED assert triggered

From: Alvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org>
To: Simon Riggs <simon(dot)riggs(at)enterprisedb(dot)com>
Cc: "Bossart, Nathan" <bossartn(at)amazon(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: SKIP LOCKED assert triggered
Date: 2022-01-03 22:27:42
Message-ID: 202201032227.wxtyzzbzt5xj@alvherre.pgsql
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 2021-Dec-01, Simon Riggs wrote:

> On Wed, 1 Dec 2021 at 14:33, Bossart, Nathan <bossartn(at)amazon(dot)com> wrote:
> >
> > On 11/12/21, 8:56 AM, "Simon Riggs" <simon(dot)riggs(at)enterprisedb(dot)com> wrote:
> > > The combination of these two statements in a transaction hits an
> > > Assert in heapam.c at line 4770 on REL_14_STABLE
> >
> > I've been unable to reproduce this. Do you have any tips for how to
> > do so? Does there need to be some sort of concurrent workload?
>
> That path is only ever taken when there are multiple sessions, and as
> I said, pgbench finds this reliably. I guess I didn't say "use -c 2"

Simon had sent me the pgbench scripts earlier, so I attach them here.
I don't actually get a crash with -c2 or -c3, but I do get almost
immediate crashes with -c4 and above. If I run it under "rr", it
doesn't occur either. I suspect the rr framework kills concurrency in
some way that hides the problem. I didn't find a way to reproduce it
with isolationtester. (If somebody wants to play with a debugger, I
find that it's much easier to reproduce by adding a short sleep after
the UpdateXmaxHintBits() call in line 4735; but that sleep occurs in a
session *other* than the one that dies. And under rr I still don't see
a crash with a sleep there; in fact the sleep doesn't seem to occur at
all, which is weird.)

The patch does fix the crasher under pgbench, and I think it makes sense
that you can get WouldBlock and yet have the tuple marked with
XMAX_INVALID: if transaction A is writing the tuple, and transaction B
is acquiring the tuple lock, then transaction C also tries to acquire
the tuple lock but that returns nay (because of B), then transaction A
completes, then transaction B could set the XMAX_INVALID flag in time
for C to have a seizure in its way out. So patching the assertion to
allow the case is correct.

What I don't understand is why hasn't this been reported already: this
bug is pretty old. My only explanation is that nobody runs sufficiently-
concurrent load with SKIP LOCKED in assert-enabled builds.

[1] https://www.postgresql.org/message-id/flat/CADLWmXUvd5Z%2BcFczi6Zj1WcTrXzipgP-wj0pZOWSaRUy%3DF0omQ%40mail.gmail.com

--
Álvaro Herrera Valdivia, Chile — https://www.EnterpriseDB.com/

Attachment Content-Type Size
simon_setup.sql application/sql 498 bytes
simon_bench.sql application/sql 24 bytes

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Chapman Flack 2022-01-03 22:32:56 Re: Accessing fields past CATALOG_VARLEN
Previous Message Ed Behn 2022-01-03 22:23:54 Accessing fields past CATALOG_VARLEN