Re: Stefan's bug (was: max_standby_delay considered harmful)

From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Alvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org>
Cc: Stefan Kaltenbrunner <stefan(at)kaltenbrunner(dot)cc>, Simon Riggs <simon(at)2ndquadrant(dot)com>, Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Florian Pflug <fgp(at)phlo(dot)org>, Dimitri Fontaine <dfontaine(at)hi-media(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, Bruce Momjian <bruce(at)momjian(dot)us>, Greg Smith <greg(at)2ndquadrant(dot)com>, Josh Berkus <josh(at)agliodbs(dot)com>
Subject: Re: Stefan's bug (was: max_standby_delay considered harmful)
Date: 2010-05-17 01:25:41
Message-ID: AANLkTikueVPERe7CzQdYp2nWkDUdH7xFqooZEdZ_EUJa@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, May 12, 2010 at 10:41 PM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> On Wed, May 12, 2010 at 10:36 PM, Alvaro Herrera
> <alvherre(at)alvh(dot)no-ip(dot)org> wrote:
>> Excerpts from Robert Haas's message of mié may 12 20:48:41 -0400 2010:
>>> On Wed, May 12, 2010 at 3:55 PM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
>>> > I am wondering if we are not correctly handling the case where we get
>>> > a shutdown request while we are still in the PM_STARTUP state.  It
>>> > looks like we might go ahead and switch to PM_RECOVERY and then
>>> > PM_RECOVERY_CONSISTENT without noticing the shutdown.  There is some
>>> > logic to handle the shutdown when the startup process exits, but if
>>> > the startup process never exits it looks like we might get stuck.
>>>
>>> I can reproduce the behavior Stefan is seeing consistently using the
>>> attached patch.
>>>
>>> W1: postgres -D ~/pgslave
>>> W2: pg_ctl -D ~/pgslave stop
>>
>> If there's anything to learn from this patch, is that sleep is
>> uninterruptible on some platforms.  This is why sleeps elsewhere are
>> broken down in loops, sleeping in small increments and checking
>> interrupts each time.  Maybe some of the sleeps in the new HS code need
>> to be handled this way?
>
> I don't think the problem is that the sleep is uninterruptible.  I
> think the problem is that a smart shutdown request received while in
> the PM_STARTUP state does not acted upon until we enter the PM_RUN
> state.  That is, there's a race condition between the SIGUSR1 that the
> startup process sends to the postmaster to signal that recovery has
> begun and the SIGTERM being sent by pg_ctl.
>
> However, since I haven't succeeded in producing a fix yet, take that
> with a grain of salt...

I'm replying to this thread rather than the max_standby_delay thread
because this line of discussion is not actually related to
max_standby_delay.

Fujii Masao previously reported this problem and proposed this fix:
http://archives.postgresql.org/pgsql-hackers/2010-04/msg00592.php

I responded by proposing that we instead simply adjust things so that
a smart shutdown is always handled at the end of recovery, just prior
to entering normal running. On further reflection, though, I've
concluded that was a dumb idea. Currently, when a smart shutdown
occurs during recovery, we handle it immediately UNLESS it happens
before we receive the signal that tells us it's OK to start the
background writer, in which case we get confused and lock everyone out
of the database until a fast or immediate shutdown request arrives.
So this is just a race condition, plain and simple. Therefore I think
Fujii Masao's original idea was the best, but I have what I believe is
an equivalent but simpler implementation, which is attached.

Thoughts? Should we try to fix this in 8.4 also, or just in HEAD?
8.3 and 8.2 never handle a smart shutdown prior to entering normal
running, and while that seems pretty useless, doing something
different would be a behavior change, so that seems like a
non-starter. 8.4 has the same behavior as HEAD, though it's not
documented in the release notes, so it's not clear how intentional the
change was.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company

Attachment Content-Type Size
fix_smart_shutdown_in_recovery.patch application/octet-stream 546 bytes

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Andrew Dunstan 2010-05-17 01:27:48 Re: pg_upgrade and extra_float_digits
Previous Message Florian Pflug 2010-05-17 01:23:13 Re: Row-level Locks & SERIALIZABLE transactions, postgres vs. Oracle