Re: Bug in walsender when calling out to do_pg_stop_backup (and others?)

From: Florian Pflug <fgp(at)phlo(dot)org>
To: Magnus Hagander <magnus(at)hagander(dot)net>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Bug in walsender when calling out to do_pg_stop_backup (and others?)
Date: 2011-10-06 12:34:54
Message-ID: 4F675930-733E-4038-8E6F-25ED36D64990@phlo.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Oct5, 2011, at 15:30 , Magnus Hagander wrote:
> When walsender calls out to do_pg_stop_backup() (during base backups),
> it is not possible to terminate the process with a SIGTERM - it
> requires a SIGKILL. This can leave unkillable backends for example if
> archive_mode is on and archive_command is failing (or not set). A
> similar thing would happen in other cases if walsender calls out to
> something that would block (do_pg_start_backup() for example), but the
> stop one is easy to provoke.

Hm, this seems to be related to another buglet I noticed a while ago,
but then forgot about again. If one terminates pg_basebackup while it's
waiting for all required WAL to be archived, the backend process only
exits once that waiting phase is over. If, like in your failure case,
archive_command fails indefinity (or isn't set), the backend process
stays around forever.

Your patch would improve that only insofar as it'd at least allow an
immediate shutdown request to succeed - as it stands, that doesn't work
because, as you mentioned, the blocked walsender doesn't handle SIGTERM.

The question is, should we do more? To me, it'd make sense to terminate
a backend once it's connection is gone. We could, for example, make
pq_flush() set a global flag, and make CHECK_FOR_INTERRUPTS handle a
broken connection that same way as a SIGINT or SIGTERM.

Thoughts?

> ISTM one way to fix it is the attached, which is to have walsender set
> the "global" flags saying that we have received sigterm, which in turn
> causes the CHECK_FOR_INTERRUPTS() calls in the routines to properly
> exit the process. AFAICT it works fine. Any holes in this approach?

Seems sensible, but my knowledge about these code paths is quite limited..

> Second, I wonder if we should add a SIGINT handler as well, that would
> make it possible to send a cancel signal. Given that the end result
> would be the same (at least if we want to keep with the "walsender is
> simple" path), I'm not sure it's necessary - but it would at least
> help those doing pg_cancel_backend()... thoughts?

If all that's needed is a simple SIGINT handler that sets one flag, it'd
say let's add it.

best regards,
Florian Pflug

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Robert Haas 2011-10-06 13:24:42 Re: alter table only ... drop constraint broken in HEAD
Previous Message Robert Haas 2011-10-06 12:11:52 Re: Inserting heap tuples in bulk in COPY