Re: stats collector dies in current

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Jan Wieck <JanWieck(at)Yahoo(dot)com>
Cc: Tatsuo Ishii <t-ishii(at)sra(dot)co(dot)jp>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: stats collector dies in current
Date: 2004-08-15 04:19:08
Message-ID: 19363.1092543548@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Jan Wieck <JanWieck(at)Yahoo(dot)com> writes:
> In that context, is SIGTSTP similar to SIGSTOP in that it cannot be
> caught or ignored?

Possibly. I've reproduced the problem here on an RHL 8 system
(2.4.18 kernel) and I think it's a kernel bug. Points:

1. AFAICS, the only case where the stats buffer process will exit(1)
without logging a prior message is where it's gotten SIGCHLD. So,
hypothesis: it is the collector process (grandchild process) that
is dying.

2. Experiment one: try to strace the collector process to see what
it's doing. Result: failure goes away!!!

3. Experiment two: try to strace the buffer process. Result: indeed
it's getting SIGCHLD (in fact it seems to get it before SIGTSTP
arrives).

So at the very least we've got a Heisenbug, but my opinion is we are
seeing broken kernel behavior.

The only difference in signal handling that I can see from 7.4 is that
the collector process explicitly executes pqsignal calls to re-establish
all the signal handlers it should have inherited from its parent.
I suspect (but haven't tested) that removing that supposedly redundant
code would make the failure go away again.

The handler re-establishment was put in because it is needed for the
EXEC_BACKEND case, but possibly we could make it #ifndef EXEC_BACKEND
to work around this problem.

regards, tom lane

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Gavin Sherry 2004-08-15 05:02:28 Re: 8.0 beta status
Previous Message Jan Wieck 2004-08-15 03:54:49 Re: stats collector dies in current