ERROR during end-of-xact/FATAL

From: Noah Misch <noah(at)leadboat(dot)com>
To: pgsql-hackers(at)postgresql(dot)org
Subject: ERROR during end-of-xact/FATAL
Date: 2013-10-31 14:52:34
Message-ID: 20131031145234.GA621493@tornado.leadboat.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

CommitTransaction() and AbortTransaction() both do much work, and large
portions of that work either should not or must not throw errors. An error
during either function will, as usual, siglongjmp() out. Ordinarily,
PostgresMain() will regain control and fire off a fresh AbortTransaction().
The consequences thereof depend on the original function's progress:

- Before the function updates CurrentTransactionState->state, an ERROR is
fully acceptable. CommitTransaction() specifically places failure-prone
tasks accordingly; AbortTransaction() has no analogous tasks.

- After the function updates CurrentTransactionState->state, an ERROR yields a
user-unfriendly e.g. "WARNING: AbortTransaction while in COMMIT state".
This is not itself harmful, but we've largely kept the things that can fail
for pedestrian reasons ahead of that point.

- After CommitTransaction() calls RecordTransactionCommit() for an xid-bearing
transaction, an ERROR upgrades to e.g. "PANIC: cannot abort transaction
805, it was already committed".

- After AbortTransaction() calls ProcArrayEndTransaction() for an xid-bearing
transaction, an ERROR will lead to this assertion failure:

TRAP: FailedAssertion("!(((allPgXact[proc->pgprocno].xid) != ((TransactionId) 0)))", File: "procarray.c", Line: 396)

If the original AbortTransaction() pertained to a FATAL, the situation is
worse. errfinish() promotes the ERROR thrown from AbortTransaction() to
another FATAL, so we reenter proc_exit(). Thanks to the following logic in
shmem_exit(), we will never return to AbortTransaction():

/*
* call all the registered callbacks.
*
* Note that since we decrement on_proc_exit_index each time, if a
* callback calls ereport(ERROR) or ereport(FATAL) then it won't be
* invoked again when control comes back here (nor will the
* previously-completed callbacks). So, an infinite loop should not be
* possible.
*/

As a result, we miss any cleanups that had not yet happened in the original
AbortTransaction(). In particular, this can leak heavyweight locks. An
asserts build subsequently fails this way:

TRAP: FailedAssertion("!(SHMQueueEmpty(&(MyProc->myProcLocks[i])))", File: "proc.c", Line: 788)

In a production build, the affected PGPROC slot just continues to hold the
lock until the next backend claiming that slot calls LockReleaseAll(). Oops.
Bottom line: most bets are off given an ERROR after RecordTransactionCommit()
in CommitTransaction() or anywhere in AbortTransaction().

Now, while those assertion failures are worth preventing on general principle,
the actual field importance depends on whether things actually do fail in the
vulnerable end-of-xact work. We've prevented the errors that would otherwise
be relatively common, but there are several rare ways to get a late ERROR.
Incomplete list:

- If smgrDoPendingDeletes() finds files to delete, mdunlink() and its callee
relpathbackend() call palloc(); this is true in all supported branches. In
9.3, due to commit 279628a0, smgrDoPendingDeletes() itself calls palloc().
(In fact, it does so even when the pending list is empty -- this is the only
palloc() during a trivial transaction commit.) palloc() failure there
yields a PANIC during commit.

- ResourceOwnerRelease() calls FileClose() during abort, and FileClose()
raises an ERROR when close() returns EIO.

- AtEOXact_Inval() can lead to calls like RelationReloadIndexInfo(), which has
many ways to throw errors. This precedes releasing heavyweight locks, so an
error here during an abort pertaining to a FATAL exit orphans locks as
described above. This relates into another recent thread:
http://www.postgresql.org/message-id/20130805170931.GA369289@tornado.leadboat.com

What should we do to mitigate these problems? Certainly we can harden
individual end-of-xact tasks to not throw errors, as we have in the past.
What higher-level strategies should we consider? What about for the unclean
result of the FATAL-then-ERROR scenario in particular? If we can't manage to
free a shared memory resource like a lock or buffer pin, we really must PANIC.
Releasing those things is quite reliable, though. The tasks that have the
highest chance of capsizing the AbortTransaction() are of backend-local
interest, or they're tasks for which we tolerate failure as a rule
(e.g. unlinking files).

Robert Haas provided a large slice of the research for this report.

Thanks,
nm

--
Noah Misch
EnterpriseDB http://www.enterprisedb.com

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2013-10-31 14:59:45 Re: Get more from indices.
Previous Message Robert Haas 2013-10-31 14:43:27 Re: Something fishy happening on frogmouth