Re: windows doesn't notice backend death

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Andrew Dunstan <andrew(at)dunslane(dot)net>, Magnus Hagander <magnus(at)hagander(dot)net>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: windows doesn't notice backend death
Date: 2009-05-03 19:04:27
Message-ID: 29196.1241377467@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

I wrote:
> Andrew Dunstan <andrew(at)dunslane(dot)net> writes:
>> Well, I can tell you that it is getting an exit code of 1, which is why
>> the postmaster isn't restarting.

> Blech. Count on Windows to find a way to break things.

I reflected on this a bit more. Even if we find a way around this
particular task-manager behavior, it seems to me there is a generic
problem here. If some bit of clueless code does exit(0) or exit(1)
inside a backend session, the postmaster will think everything is fine,
but actually we have an un-cleaned-up session that's probably still
holding locks etc. It's fairly easy to demonstrate the issue:

pl_regression=# create language plperlu;
CREATE LANGUAGE
pl_regression=# create or replace function trouble() returns void as
pl_regression-# $$ exit 0; $$ language plperlu;
CREATE FUNCTION
pl_regression=# select trouble();
server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
The connection to the server was lost. Attempting reset: Succeeded.
pl_regression=# select * from pg_stat_activity;
datid | datname | procpid | usesysid | usename | current_query | waiting | xact_start | query_start | backend_start | client_addr | client_port
-------+---------------+---------+----------+---------+---------------------------------+---------+-------------------------------+-------------------------------+-------------------------------+-------------+-------------
40179 | pl_regression | 20847 | 10 | tgl | select trouble(); | f | 2009-05-03 14:46:10.170604-04 | 2009-05-03 14:46:10.170604-04 | 2009-05-03 14:45:10.911359-04 | | -1
40179 | pl_regression | 20855 | 10 | tgl | select * from pg_stat_activity; | f | 2009-05-03 14:46:23.986909-04 | 2009-05-03 14:46:23.986909-04 | 2009-05-03 14:46:17.920486-04 | | -1
(2 rows)

Up to now we've always just dismissed the above possibility as
"superusers should know better", but I think there's a reasonable case
to be made that this is an obvious failure mode and we should put a bit
more effort into being robust against it. With more and more external
code being routinely run in the backend, who wants to swear that there
is no "exit(1)" in the guts of libperl or libxml or whatever?

The first idea that comes to mind is to have some sort of "dead man
switch" that flags an active backend and is reset by proc_exit() after
it's finished cleaning up everything else. If the postmaster sees
this flag still set after backend exit, then it treats the backend as
having crashed regardless of what the reported exit code is.
We could implement this via an array of sig_atomic_t in shared memory,
so as to minimize the postmaster's entanglement with shared memory
(it'd be no worse than the old WIN32-specific child pid arrays).

Or maybe there's a better way. Thoughts?

regards, tom lane

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message justin 2009-05-03 19:51:04 Re: windows doesn't notice backend death
Previous Message Tom Lane 2009-05-03 18:14:02 Re: windows doesn't notice backend death