Re: [BUG] failed assertion in EnsurePortalSnapshotExists()

From: "Drouvot, Bertrand" <bdrouvot(at)amazon(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: [BUG] failed assertion in EnsurePortalSnapshotExists()
Date: 2021-09-29 09:54:40
Message-ID: 71af37bc-fc25-ed86-acf2-176d153ffb9b@amazon.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

On 9/28/21 6:50 PM, Tom Lane wrote:
> "Drouvot, Bertrand" <bdrouvot(at)amazon(dot)com> writes:
>> Does it make sense (as it is currently) to set the ActiveSnapshot to
>> NULL and not ensuring the same is done for ActivePortal->portalSnapshot?
> I think that patch is just a kluge :-(
right, sounded too simple to be "true".
>
> After tracing through this I've figured out what is happening, and
> why you need to put the exception block inside a loop to make it
> happen. The first iteration goes fine, and we end up with an empty
> ActiveSnapshot stack (and no portalSnapshots) because that's how
> the COMMIT leaves it. In the second iteration, we try to
> re-establish the portal snapshot, but at the point where
> EnsurePortalSnapshotExists is called (from the first INSERT
> command) we are already inside a subtransaction thanks to the
> plpgsql exception block. This means that the stacked ActiveSnapshot
> has as_level = 2, although the Portal owning it belongs to the
> outer transaction. So at the next exception, AtSubAbort_Snapshot
> zaps the stacked ActiveSnapshot, but the Portal stays alive and
> now it has a dangling portalSnapshot pointer.

Thanks for the explanation!

> Basically there seem to be two ways to fix this, both requiring
> API additions to snapmgr.c (hence, cc'ing Alvaro for opinion):
>
> 1. Provide a variant of PushActiveSnapshot that allows the caller
> to specify the as_level to use, and then have
> EnsurePortalSnapshotExists, as well as other places that create
> Portal-associated snapshots, use this with as_level equal to the
> Portal's createSubid. The main hazard I see here is that this could
> theoretically allow the ActiveSnapshot stack to end up with
> out-of-order as_level values, causing issues. For the moment we
> could probably just add assertions to check that the passed as_level
> is >= next level, or maybe even that this variant is only called with
> empty ActiveSnapshot stack.
I think we may get a non empty ActiveSnapshot stack with prepared
statements, so tempted to do the assertion on the levels.
>
> 2. Provide a way for AtSubAbort_Portals to detect whether a
> portalSnapshot pointer points to a snap that's going to be
> deleted by AtSubAbort_Snapshot, and then just have it clear any
> portalSnapshots that are about to become dangling. (This'd amount
> to accepting the possibility that portalSnapshot is of a different
> subxact level from the portal, and dealing with the situation.)
>
> I initially thought #2 would be the way to go, but it turns out
> to be a bit messy since what we have is a Snapshot pointer not an
> ActiveSnapshotElt pointer. We'd have to do something like search the
> ActiveSnapshot stack looking for pointer equality to the caller's
> Snapshot pointer, which seems fragile --- do we assume as_snap is
> unique for any other purpose?
>
> That being the case, I'm now leaning to #1. Thoughts?

I'm also inclined to #1.

Please find attached a patch proposal for #1 that also adds a new test.

Thanks

Bertrand

Attachment Content-Type Size
v2-0001-EnsurePortalSnapshotExists-failed-assertion.patch text/plain 6.1 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Juan José Santamaría Flecha 2021-09-29 10:26:25 Re: stat() vs ERROR_DELETE_PENDING, round N + 1
Previous Message Amit Kapila 2021-09-29 09:13:56 Re: Added schema level support for publication.