Re: abort-time portal cleanup

From: Andres Freund <andres(at)anarazel(dot)de>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: abort-time portal cleanup
Date: 2019-09-24 20:45:10
Message-ID: 20190924204510.zy72sboc7yy3hmy3@alap3.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

On 2019-09-12 16:42:46 -0400, Robert Haas wrote:
> The trouble starts with the header comment for AtAbort_Portals, which
> states that "At this point we run the cleanup hook if present, but we
> can't release the portal's memory until the cleanup call." At the time
> this logic was introduced in commit
> de28dc9a04c4df5d711815b7a518501b43535a26 (2003-05-02),
> AtAbort_Portals() affected all non-held portals without caring whether
> they were active or not, and, UserAbortTransactionBlock() called
> AbortTransaction() directly, so typing "ROLLBACK;" would cause
> AtAbort_Portals() to be reached from within PortalRun(). Even if
> PortalRun() managed to return without crashing, the caller would next
> try to call PortalDrop() on what was now an invalid pointer. However,
> commit 8f9f1986034a2273e09ad10671e10d1adda21d1f (2004-09-16) changed
> things so that UserAbortEndTransaction() just sets things up so that
> the subsequent call to CommitTransactionCommand() would call
> AbortTransaction() instead of trying to do it right away, and that
> doesn't happen until after we're done with the portal. As far as I
> can see, that change made this comment mostly false, but the comment
> has nevertheless managed to survive for another ~15 years. I think we
> can, and in fact should, just drop the portal right here.

Nice digging.

> As far as actually making that work, there are a few wrinkles. The
> first is that we might be in the middle of FATAL. In that case, unlike
> the ROLLBACK case, a call to PortalRun() is still on the stack, but
> we'll exit the process rather than returning, so the fact that we've
> created a dangling pointer for the caller won't matter. However, as
> shown by commit ad9a274778d2d88c46b90309212b92ee7fdf9afe (2018-02-01)
> and the report that led up to it at
> https://www.postgresql.org/message-id/20180128034531.h6o4w3727ifof3jy%40alap3.anarazel.de
> it's not a good idea to try to clean up the portal in that case,
> because we might've already started shutting down critical systems.
> It seems not only risky but also unnecessary: our process-local state
> is about to go away, and the executor shouldn't need to clean up any
> shared memory state that won't also get cleaned up by some other
> mechanism. So, it seems to me that if we reach AtAbort_Portals()
> during FATAL processing, we should either (1) do nothing at all and
> just return or (2) forget about all the existing portals without
> cleaning them up and then return. The second option seems a little
> safer to me, because it guarantees that if we somehow reach code that
> might try to look up a portal later, it won't find anything. But I
> think it's arguable.

Hm. Doing that cleanup requires digging through all the portals etc. I'd
rather rely on less state being correct than more during FATAL
processing.

> The second wrinkle is that there might be an active portal. Apart
> from the FATAL case already mentioned, I think the only way this can
> happen is some C code that calls purposefully calls AbortTransaction()
> in the middle of executing a command. It can't be an ERROR, because
> then the portal would be marked as failed before we get here, and it
> can't be an explicit ROLLBACK, because as noted above, that case was
> changed 15 years ago. It's got to be some other case where C code
> calls AbortTransaction() voluntarily in the middle of a statement. For
> over a decade, there were no cases at all of this type, but the code
> in this function catered to hypothetical cases by marking the portal
> failed. By 2016, Noah had figured out that this was bogus, and that
> any future cases would likely require different handling, but Tom and
> I shouted him down:

Hm. But wouldn't doing so run into a ton of problems anyway? I mean we
need to do a ton of checks and special case hangups to make
CommitTransactionCommand();StartTransactionCommand(); work for VACUUM,
CIC, ...

The cases where one can use AbortTransaction() (via
AbortCurrentTransaction() presumably) are ones where either there's no
surrounding code relying on the transaction (e.g. autovacuum,
postgres.c), or where special care has been taken with portals
(e.g. _SPI_rollback()). We didn't have the pin mechanism back then, so
I think even if we accept your/Tom's reasoning from back then (I don't
really), it's outdated now that the pin mechanism exists.

I'd be happy if we added some defenses against such bogus cases being
introduced (i.e. erroring out if we encounter an active portal during
abort processing).

> Stepping back a bit, stored procedures are a good example of a command
> that uses multiple transactions internally. We have a few others, such
> as VACUUM, but at present, that case only commits transactions
> internally; it does not roll them back internally. If it did, it
> would want the same thing that procedures want, namely, to leave the
> active portal alone. It doesn't quite work to leave the active portal
> completely alone, because the portal has got a pointer to a
> ResourceOwner which is about to be freed by end-of-transaction
> cleanup;

Well, that's why _SPI_commit()/_SPI_rollback() do a HoldPinnedPortals(),
which, via HoldPortal(), sets portal->resowner to = NULL.

> The current code also calls PortalReleaseCachedPlan in this case; I'm
> not 100% certain whether that's appropriate or not.

I think it's appropriate, because we cannot guarantee that the plan is
still usable. Besides normal plan invalidation issues, the catalog
contents the plan might rely on might have only existed in the aborted
transaction - which seems like a fatal problem. That's why holding
portals persists the portalstore. Which then also obsoletes the plan.

> Attached is a patch that (1) removes At(Sub)Cleanup_Portals() entirely
> in favor of dropping portals on the spot in At(Sub)Abort_Portals(),
> (2) overhauls the comments in this area, and (3) makes
> AtSubAbort_Portals() consistent with AtAbort_Portals(). One possible
> concern here - which Andres mentioned to me during off-list discussion
> - is that someone might create a portal for a ROLLBACK statement, and
> then try to execute that portal after an error has occurred.

Besides not quite happening as I though, as you reference below, I think
that only mattered if a rollback gets executed in a failed transaction -
that has to happen after a sync message, because otherwise we'd just
skip the command. But to be problematic, the bind would have to be from
before the sync, and the exec from after - which'd be an extremely
absurd use of the protocol.

> /*
> * Abort processing for portals.
> *
> - * At this point we run the cleanup hook if present, but we can't release the
> - * portal's memory until the cleanup call.
> + * Most portals don't and shouldn't survive transaction abort, but there are
> + * some important special cases where they do and must: (1) held portals must
> + * survive by definition, and (2) any active portal must be part of a command
> + * that uses multiple transactions internally, and needs to survive until
> + * execution of that command has completed.

Hm. Why are active, rather than pinnned, portals relevant here? Normal
multi-transactional commands (e.g. vacuum, CIC) shouldn't get here with
an active portal, as they don't catch errors. And procedures should have
pinned the relevant portals?

Greetings,

Andres Freund

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Peter Eisentraut 2019-09-24 21:08:25 Re: JSONPATH documentation
Previous Message Tom Lane 2019-09-24 20:19:41 Re: allocation limit for encoding conversion