elog(FATAL) vs shared memory

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: pgsql-hackers(at)postgreSQL(dot)org
Cc: Martin Pitt <martin(dot)pitt(at)ubuntu(dot)com>, Mark Shuttleworth <mark(at)ubuntu(dot)com>
Subject: elog(FATAL) vs shared memory
Date: 2007-04-04 18:04:50
Message-ID: 23609.1175709890@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

In this thread:
http://archives.postgresql.org/pgsql-bugs/2007-03/msg00145.php
we eventually determined that the reported lockup had three components:

(1) something (still not sure what --- Martin and Mark, I'd really like
to know) was issuing random SIGTERMs to various postgres processes
including autovacuum.

(2) if a SIGTERM happens to arrive while btbulkdelete is running,
the next CHECK_FOR_INTERRUPTS will do elog(FATAL), causing elog.c
to do proc_exit(0), leaving the vacuum still recorded as active in
the shared memory array maintained by _bt_start_vacuum/_bt_end_vacuum.
The PG_TRY block in btbulkdelete doesn't get a chance to clean up.

(3) eventually, either we try to re-vacuum the same index or
accumulation of bogus active entries overflows the array.
Either way, _bt_start_vacuum throws an error, which btbulkdelete
PG_CATCHes, leading to_bt_end_vacuum trying to re-acquire the LWLock
already taken by _bt_start_vacuum, meaning that the process hangs up.
And then so does anything else that needs to take that LWLock...

Point (3) is already fixed in CVS, but point (2) is a lot nastier.
What it essentially says is that trying to clean up shared-memory
state in a PG_TRY block is unsafe: you can't be certain you'll
get to do it. Now this is not a big deal during normal SIGTERM or
SIGQUIT database shutdown, because we're going to abandon the shared
memory segment anyway. However, if we ever want to support individual
session kill via SIGTERM, it's a problem. Even if we were not
interested in someday considering that a supported feature, it seems
that dealing with random SIGTERMs is needed for robustness in at least
some environments.

AFAICS, there are basically two ways we might try to approach this:

Plan A: establish the rule that you mustn't try to clean up shared
memory state in a PG_CATCH block. Anything you need to do like that
has to be handled by an on_shmem_exit hook function, so it will be
called during a FATAL exit. (Or maybe you can do it in PG_CATCH for
normal ERROR cases, but you need a backing on_shmem_exit hook to
clean up for FATAL.)

Plan B: change the handling of FATAL errors so that they are thrown
like normal errors, and the proc_exit call happens only when we get
out to the outermost control level in postgres.c. This would mean
that PG_CATCH blocks get a chance to clean up before the FATAL exit
happens. The problem with that is that a non-cooperative PG_CATCH
block might think it could "recover" from the error, and then the exit
does not happen at all. We'd need a coding rule that PG_CATCH blocks
*must* re-throw FATAL errors, which seems at least as ugly as Plan A.
In particular, all three of the external-interpreter PLs are willing
to return errors into the external interpreter, and AFAICS we'd be
entirely at the mercy of the user-written Perl or Python or Tcl code
whether it re-throws the error or not.

So Plan B seems unacceptably fragile. Does anyone see a way to fix it,
or perhaps a Plan C with a totally different idea? Plan A seems pretty
ugly but it's the best I can come up with.

regards, tom lane

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Gregory Stark 2007-04-04 18:06:50 Re: Auto Partitioning
Previous Message Zoltan Boszormenyi 2007-04-04 17:44:16 Re: IDENTITY/GENERATED v36 Re: Final version of IDENTITY/GENERATED patch