Re: logical replication and PANIC during shutdown checkpoint in publisher

From: Petr Jelinek <petr(dot)jelinek(at)2ndquadrant(dot)com>
To: Michael Paquier <michael(dot)paquier(at)gmail(dot)com>, Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>
Cc: Andres Freund <andres(at)anarazel(dot)de>, Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: logical replication and PANIC during shutdown checkpoint in publisher
Date: 2017-04-23 01:15:40
Message-ID: 9391d009-3fec-4255-4bbf-ff54de511c5a@2ndquadrant.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 21/04/17 06:11, Michael Paquier wrote:
> On Fri, Apr 21, 2017 at 12:29 AM, Peter Eisentraut
> <peter(dot)eisentraut(at)2ndquadrant(dot)com> wrote:
>> On 4/20/17 07:52, Petr Jelinek wrote:
>>> On 20/04/17 05:57, Michael Paquier wrote:
>>>> 2nd thoughts here... Ah now I see your point. True that there is no
>>>> way to ensure that an unwanted command is not running when SIGUSR2 is
>>>> received as the shutdown checkpoint may have already begun. Here is an
>>>> idea: add a new state in WalSndState, say WALSNDSTATE_STOPPING, and
>>>> the shutdown checkpoint does not run as long as all WAL senders still
>>>> running do not reach such a state.
>>>
>>> +1 to this solution
>>
>> Michael, can you attempt to supply a patch?
>
> Hmm. I have been actually looking at this solution and I am having
> doubts regarding its robustness. In short this would need to be
> roughly a two-step process:
> - In PostmasterStateMachine(), SIGUSR2 is sent to the checkpoint to
> make it call ShutdownXLOG(). Prior doing that, a first signal should
> be sent to all the WAL senders with
> SignalSomeChildren(BACKEND_TYPE_WALSND). SIGUSR2 or SIGINT could be
> used.
> - At reception of this signal, all WAL senders switch to a stopping
> state, refusing commands that can generate WAL.
> - Checkpointer looks at the state of all WAL senders, looping with a
> sleep call of a couple of ms, refusing to launch the shutdown
> checkpoint as long as all WAL senders have not switched to the
> stopping state.
> - In reaper(), once checkpointer is confirmed as stopped, signal again
> the WAL senders, and tell them to perform the last loop.
>
> After that, I got a second, more simple idea.
> CheckpointerShmem->ckpt_flags holds the information about checkpoints
> currently running, so we could have the WAL senders look at this data
> and prevent any commands generating WAL. The checkpointer may be
> already stopped at the moment the WAL senders finish their loop, so we
> need also to check if the checkpointer is running or not on those code
> paths. Such safeguards may actually be enough for the problem of this
> thread. Thoughts?
>

Hmm but how do we handle statements that are already in progress by the
time ckpt_flags changes?

--
Petr Jelinek http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Thomas Munro 2017-04-23 02:11:40 Re: [COMMITTERS] pgsql: Replication lag tracking for walsenders
Previous Message Andrew Dunstan 2017-04-23 00:56:28 Re: PostgresNode::append_conf considered dangerous