Re: logical replication and PANIC during shutdown checkpoint in publisher

From: Michael Paquier <michael(dot)paquier(at)gmail(dot)com>
To: Petr Jelinek <petr(dot)jelinek(at)2ndquadrant(dot)com>
Cc: Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: logical replication and PANIC during shutdown checkpoint in publisher
Date: 2017-04-24 02:40:55
Message-ID: CAB7nPqTHQJ7zODLCLmhg9oz03qow6eh27NhEqzT+CT64eku5Ug@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Sun, Apr 23, 2017 at 10:15 AM, Petr Jelinek
<petr(dot)jelinek(at)2ndquadrant(dot)com> wrote:
> On 21/04/17 06:11, Michael Paquier wrote:
>> On Fri, Apr 21, 2017 at 12:29 AM, Peter Eisentraut
>> <peter(dot)eisentraut(at)2ndquadrant(dot)com> wrote:
>> Hmm. I have been actually looking at this solution and I am having
>> doubts regarding its robustness. In short this would need to be
>> roughly a two-step process:
>> - In PostmasterStateMachine(), SIGUSR2 is sent to the checkpoint to
>> make it call ShutdownXLOG(). Prior doing that, a first signal should
>> be sent to all the WAL senders with
>> SignalSomeChildren(BACKEND_TYPE_WALSND). SIGUSR2 or SIGINT could be
>> used.
>> - At reception of this signal, all WAL senders switch to a stopping
>> state, refusing commands that can generate WAL.
>> - Checkpointer looks at the state of all WAL senders, looping with a
>> sleep call of a couple of ms, refusing to launch the shutdown
>> checkpoint as long as all WAL senders have not switched to the
>> stopping state.
>> - In reaper(), once checkpointer is confirmed as stopped, signal again
>> the WAL senders, and tell them to perform the last loop.

OK, I have been hacking that, finishing with the attached. In the
attached I am using SIGUSR2 to instruct the WAL senders to prepare for
stopping, and SIGINT to handle the last WAL flush loop. The shutdown
checkpoint moves on only if all active WAL senders are marked with a
STOPPING state. Reviews as welcome.

>> After that, I got a second, more simple idea.
>> CheckpointerShmem->ckpt_flags holds the information about checkpoints
>> currently running, so we could have the WAL senders look at this data
>> and prevent any commands generating WAL. The checkpointer may be
>> already stopped at the moment the WAL senders finish their loop, so we
>> need also to check if the checkpointer is running or not on those code
>> paths. Such safeguards may actually be enough for the problem of this
>> thread. Thoughts?
>>
>
> Hmm but how do we handle statements that are already in progress by the
> time ckpt_flags changes?

Yup, this does not handle well race conditions.
--
Michael

Attachment Content-Type Size
walsender-chkpt-v1.patch application/octet-stream 12.7 KB

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Michael Paquier 2017-04-24 02:48:49 Re: identity columns
Previous Message Tom Lane 2017-04-24 02:33:20 Re: TAP tests - installcheck vs check