Re: spoonbill vs. -HEAD

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Stefan Kaltenbrunner <stefan(at)kaltenbrunner(dot)cc>
Cc: Postgresql Hackers Mailinglist <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: spoonbill vs. -HEAD
Date: 2013-04-02 22:01:50
Message-ID: 10127.1364940110@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Stefan Kaltenbrunner <stefan(at)kaltenbrunner(dot)cc> writes:
> On 03/26/2013 11:30 PM, Tom Lane wrote:
>> A different line of thought is that the cancel was received by the
>> backend but didn't succeed in cancelling the query for some reason.

> I added the "pgcancel failed" codepath you suggested but it does not
> seem to get triggered at all so the above might actually be what is
> happening...

Stefan was kind enough to grant me access to spoonbill, and after
some experimentation I found out the problem. It seems that OpenBSD
blocks additional deliveries of a signal while the signal handler is
in progress, and that this is implemented by just calling sigprocmask()
before and after calling the handler. Therefore, if the handler doesn't
return normally --- like, say, it longjmps --- the restoration of the
previous mask never happens. So we're left with the signal still
blocked, meaning second and subsequent attempts to interrupt the backend
don't work.

This isn't revealed by the regular regression tests because they don't
exercise PQcancel, but several recently-added isolation tests do attempt
to PQcancel the same backend more than once.

It's a bit surprising that it's taken us this long to recognize the
problem. Typical use of PQcancel doesn't necessarily cause a failure:
StatementCancelHandler() won't exit through longjmp unless
ImmediateInterruptOK is true, which is only the case while waiting for a
heavyweight lock. But still, you'd think somebody would've run into
the case in normal usage.

I think the simplest fix is to insert "PG_SETMASK(&UnBlockSig)" into
StatementCancelHandler() and any other handlers that might exit via
longjmp. I'm a bit inclined to only do this on platforms where a
problem is demonstrable, which so far is only OpenBSD. (You'd
think that all BSDen would have the same issue, but the buildfarm
shows otherwise.)

BTW, this does not seem to explain the symptoms shown at
http://www.postgresql.org/message-id/4FE4D89A.8020002@kaltenbrunner.cc
because what we were seeing there was that *all* signals appeared to be
blocked. However, after this round of debugging I no longer have a lot
of faith in OpenBSD's ps, because it was lying to me about whether the
process had signals blocked or not (or at least, it couldn't see the
effects of the interrupt signal disable, although when I added debugging
code to print the active signal mask according to sigprocmask() I got
told the truth). So I'm not sure how much trust to put in those older
ps results. It's possible that the previous failures were a
manifestation of something related to this bug.

regards, tom lane

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message David E. Wheeler 2013-04-02 22:19:33 CREATE EXTENSION BLOCKS
Previous Message Alexander Korotkov 2013-04-02 21:54:16 Re: WIP: index support for regexp search