Re: BUG #5269: postgres backend terminates with SIGSEGV

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Justin Pitts <justinpitts(at)gmail(dot)com>
Cc: Alvaro Herrera <alvherre(at)commandprompt(dot)com>, pgsql-bugs(at)postgreSQL(dot)org
Subject: Re: BUG #5269: postgres backend terminates with SIGSEGV
Date: 2010-01-13 04:53:48
Message-ID: 4010.1263358428@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

Justin Pitts <justinpitts(at)gmail(dot)com> writes:
> It went seriously heisenbug since I added %p. It took all day to crash it since I added that to the configuration. Except this latest crash is completely different - signal 6 ABRT, different place.

Specifically:

2010-01-13 01:50:42 UTC 5455 ERROR: snapshot reference 0xba6ac0 is not owned by resource owner TopTransaction
2010-01-13 01:50:42 UTC 5455 STATEMENT: COMMIT
2010-01-13 01:50:42 UTC 5455 WARNING: AbortTransaction while in COMMIT state
2010-01-13 01:50:42 UTC 5455 PANIC: cannot abort transaction 777554, it was already committed

This looks different, but it matches what I had concluded from the
previous dumps: snapmgr.c thinks it has a valid serializable snapshot,
but it doesn't. In your previous crashes, CurrentSnaphot was pointing
at trashed memory; here, it's still a valid-looking snapshot, but it's
not registered in the TopTransactionResourceOwner like it's supposed to
be (it wasn't in the other cases either).

After puzzling over this for many hours, I have a theory that seems to
fit the facts. What I think must be happening is that
GetTransactionSnapshot is getting called outside any transaction ---
that is, after AtEOXact_Snapshot but before CleanupTransaction (thus,
the old TopTransactionResourceOwner is still there, so its attempt
to register the snap doesn't dump core). Your build isn't
assert-enabled so the extra snapshot registration wouldn't be noticed
by the eventual ResourceOwnerDelete. Then, when we next start a
transaction and request a serializable snapshot, GetTransactionSnapshot
thinks it has nothing to do and happily hands back a pointer to garbage.

Now, how the heck did GetTransactionSnapshot get called that way?
Your log shows that you're using client software that likes to abort
failed transactions with a previously prepared statement:

2010-01-13 01:50:41 UTC 5455 LOG: execute S_2: ROLLBACK

This means that to get to CleanupTransaction, we have to look up a
cached plan while in aborted transaction state. And that will call
RevalidateCachedPlan, which does this if the plan is marked dead:

/*
* If a snapshot is already set (the normal case), we can just use
* that for parsing/planning. But if it isn't, install one. Note: no
* point in checking whether parse analysis requires a snapshot;
* utility commands don't have invalidatable plans, so we'd not get
* here for such a command.
*/
if (!ActiveSnapshotSet())
{
PushActiveSnapshot(GetTransactionSnapshot());
snapshot_set = true;
}

As the comment indicates, it should not be possible to get here for a
ROLLBACK plan. But look what got added in 8.4: ResetPlanCache(), which
marks plans dead *unconditionally*. And one of the ways that that can
get invoked is via a SI queue overrun event, which triggers a general
cache flush and in particular calls PlanCacheSysCallback().

So what is needed to make this scenario happen is a SI queue overrun,
which isn't a common event unless your application does a lot
of catalog manipulations. I suppose that the reason your reporting
query manages to tickle it is that it creates/deletes temp tables.
And even then, you'd need an actual failure on the side receiving
the overrun notification. In the cases at hand that failure is

2010-01-13 01:50:41 UTC 5455 ERROR: could not serialize access due to concurrent update

which doesn't happen real often in your app, but does happen.

So this theory explains what I found in the core dumps, which is
otherwise dang hard to explain (I spent a lot of time looking for
code paths that could bypass AtEOXact_Snapshot, but there aren't
any). It explains why the failure is uncommon. It explains why the
failures come in bunches (since a SI overrun is broadcast, and
would thereby poison all active sessions with "dead" ROLLBACK
plans; and the subsequent serialization errors could easily come
in bunches too). It explains why we haven't seen similar stuff
in testing --- preparing something as trivial as a ROLLBACK is
pretty silly so nobody does it. And it explains why we didn't
see it before 8.4 --- the bug was in fact introduced here:
http://archives.postgresql.org/pgsql-committers/2008-09/msg00084.php
So even though I haven't actually reproduced the problem, I have a
fair amount of confidence in the explanation.

We could perhaps fix this by changing the above-quoted code, but
I'm inclined to teach ResetPlanCache that utility statements should
never be marked dead. They don't have plans in the normal sense
of the word so there's no point in forcing a revalidation cycle.

I'm kinda beat right now but will work on a patch tomorrow.

regards, tom lane

In response to

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message Kris Jurka 2010-01-13 07:07:21 Re: BUG #5269: postgres backend terminates with SIGSEGV
Previous Message Robert Haas 2010-01-13 01:48:36 Re: BUG #5273: Unexpected function behavior/failure