Re: Server crash (FailedAssertion) due to catcache refcount mis-handling

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Andreas Seltenreich <seltenreich(at)gmx(dot)de>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Jeevan Chalke <jeevan(dot)chalke(at)enterprisedb(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Server crash (FailedAssertion) due to catcache refcount mis-handling
Date: 2017-08-13 18:04:41
Message-ID: 9291.1502647481@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Andreas Seltenreich <seltenreich(at)gmx(dot)de> writes:
> Tom Lane writes:
>> I wonder if Andreas would be interested in trying the randomly-timed-
>> SIGTERM thing with sqlsmith.

> So far, most of the core dumps generated are Jeevan's assertion failing
> with backtraces through SearchCatCacheList. The rest is failing this
> assertion:
> TRAP: FailedAssertion("!(portal->cleanup == ((void *)0))", File: "portalmem.c", Line: 846)
> Example backtrace below. They all happened during a rollback statement.
> Testing was done on master at 2336f84284.

Interesting. I can reproduce this by forcing a FATAL exit right there,
eg by adding

if (pstmt->utilityStmt && IsA(pstmt->utilityStmt, TransactionStmt) &&
((TransactionStmt *) pstmt->utilityStmt)->kind == TRANS_STMT_ROLLBACK)
InterruptPending = ProcDiePending = true;

before PortalRunMulti's CHECK_FOR_INTERRUPTS. But it only fails if the
rollback is trying to exit a transaction that's already suffered an error.
The explanation seems to be:

1. Because we already aborted the transaction, xact.c is in state
blockState == TBLOCK_ABORT.

2. This causes AbortOutOfAnyTransaction to think that it can skip
doing AbortTransaction() and go straight to CleanupTransaction().

3. AtCleanup_Portals() is expecting that we already ran, and cleared,
the portal cleanup hook (PortalCleanup) for every live portal. But
we have not done so for the active portal that we were in the midst
of running ROLLBACK in. So it fails the mentioned assertion.

Thus, the basic problem is that we get to CleanupTransaction() without
having done PortalDrop on the portal we're running ROLLBACK in.

We could take this as an argument for simply removing that assertion,
which would mean that when AtCleanup_Portals calls PortalDrop, the cleanup
hook would get run, the same as it would have if exec_simple_query had had
a chance to drop the portal. However, I'm still pretty afraid of allowing
arbitrary code to execute during CleanupTransaction(). What seems like
a better idea is to make AtCleanup_Portals just summarily clear the
cleanup hook, perhaps while emitting a warning.

I tried that (see portalmem.c changes in attached patch), and what I got
was this:

regression=# rollback;
WARNING: skipping cleanup for portal ""
FATAL: terminating connection due to administrator command
FATAL: cannot drop active portal ""
server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
The connection to the server was lost. Attempting reset: Succeeded.

or in the server log:

2017-08-13 13:49:50.312 EDT [8737] FATAL: terminating connection due to administrator command
2017-08-13 13:49:50.312 EDT [8737] STATEMENT: rollback;
2017-08-13 13:49:50.312 EDT [8737] WARNING: skipping cleanup for portal ""
2017-08-13 13:49:50.312 EDT [8737] FATAL: cannot drop active portal ""

Well, we could tolerate that maybe, but it doesn't seem like it's actually
acting nicely; the active portal is still creating problems for us.

After some further thought, I decided to try making
AbortOutOfAnyTransaction call AtAbort_Portals() in this situation, thus
basically doing the minimum part of AbortTransaction() needed to clean up
the active portal. That almost worked --- my initial try got an assertion
failure in mcxt.c, because somebody was trying to drop the
CurrentMemoryContext. So the minimum part of AbortTransaction that we
need here is really AtAbort_Memory + AtAbort_Portals. After further
thought I decided it'd be a good idea to phrase that as an unconditional
AtAbort_Memory at the top of AbortOutOfAnyTransaction, thus making sure
we are in some valid context to start with; and then, in case the loop
itself doesn't have anything to do, we need AtCleanup_Memory() at the
bottom of the function to revert CurrentMemoryContext to TopMemoryContext.

In short, the attached patch seems to fix it nicely. We definitely need
the xact.c changes, but the ones in portalmem.c are perhaps optional, as
in theory now the assertion will never be violated. But I'm inclined
to keep the portalmem changes anyway, as that will make it more robust
if the situation ever happens in a non-assert build.

Note: there are a couple of comments elsewhere in portalmem.c that need
adjustment if we keep the portalmem changes. I have not bothered to do
that yet, as it's just cosmetic.

Comments?

regards, tom lane

Attachment Content-Type Size
handle-FATAL-error-during-ROLLBACK.patch text/x-diff 3.7 KB

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Fabien COELHO 2017-08-13 18:33:32 Re: pgbench - allow to store select results into variables
Previous Message Fabien COELHO 2017-08-13 17:57:20 Re: [WIP] Zipfian distribution in pgbench