Re: [BUG] failed assertion in EnsurePortalSnapshotExists()

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: "Drouvot, Bertrand" <bdrouvot(at)amazon(dot)com>
Cc: Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: [BUG] failed assertion in EnsurePortalSnapshotExists()
Date: 2021-09-28 16:50:11
Message-ID: 2850730.1632847811@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

"Drouvot, Bertrand" <bdrouvot(at)amazon(dot)com> writes:
> Does it make sense (as it is currently) to set the ActiveSnapshot to
> NULL and not ensuring the same is done for ActivePortal->portalSnapshot?

I think that patch is just a kluge :-(

After tracing through this I've figured out what is happening, and
why you need to put the exception block inside a loop to make it
happen. The first iteration goes fine, and we end up with an empty
ActiveSnapshot stack (and no portalSnapshots) because that's how
the COMMIT leaves it. In the second iteration, we try to
re-establish the portal snapshot, but at the point where
EnsurePortalSnapshotExists is called (from the first INSERT
command) we are already inside a subtransaction thanks to the
plpgsql exception block. This means that the stacked ActiveSnapshot
has as_level = 2, although the Portal owning it belongs to the
outer transaction. So at the next exception, AtSubAbort_Snapshot
zaps the stacked ActiveSnapshot, but the Portal stays alive and
now it has a dangling portalSnapshot pointer.

Basically there seem to be two ways to fix this, both requiring
API additions to snapmgr.c (hence, cc'ing Alvaro for opinion):

1. Provide a variant of PushActiveSnapshot that allows the caller
to specify the as_level to use, and then have
EnsurePortalSnapshotExists, as well as other places that create
Portal-associated snapshots, use this with as_level equal to the
Portal's createSubid. The main hazard I see here is that this could
theoretically allow the ActiveSnapshot stack to end up with
out-of-order as_level values, causing issues. For the moment we
could probably just add assertions to check that the passed as_level
is >= next level, or maybe even that this variant is only called with
empty ActiveSnapshot stack.

2. Provide a way for AtSubAbort_Portals to detect whether a
portalSnapshot pointer points to a snap that's going to be
deleted by AtSubAbort_Snapshot, and then just have it clear any
portalSnapshots that are about to become dangling. (This'd amount
to accepting the possibility that portalSnapshot is of a different
subxact level from the portal, and dealing with the situation.)

I initially thought #2 would be the way to go, but it turns out
to be a bit messy since what we have is a Snapshot pointer not an
ActiveSnapshotElt pointer. We'd have to do something like search the
ActiveSnapshot stack looking for pointer equality to the caller's
Snapshot pointer, which seems fragile --- do we assume as_snap is
unique for any other purpose?

That being the case, I'm now leaning to #1. Thoughts?

regards, tom lane

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2021-09-28 17:27:11 Re: Fixing WAL instability in various TAP tests
Previous Message Jaime Casanova 2021-09-28 15:58:53 Re: FETCH FIRST clause PERCENT option