Re: [PoC] pg_upgrade: allow to upgrade publisher node

From: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
To: "Hayato Kuroda (Fujitsu)" <kuroda(dot)hayato(at)fujitsu(dot)com>
Cc: Julien Rouhaud <rjuju123(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: [PoC] pg_upgrade: allow to upgrade publisher node
Date: 2023-07-18 09:06:51
Message-ID: CAA4eK1L6fmTAGS3pY1YHGHhreg424wH6QwYbxqyV_7OF2AXGjw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Mon, Jul 17, 2023 at 6:19 PM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
>
> On Fri, Jun 30, 2023 at 7:29 PM Hayato Kuroda (Fujitsu)
> <kuroda(dot)hayato(at)fujitsu(dot)com> wrote:
> >
> > I have analyzed more, and concluded that there are no difference between manual
> > and shutdown checkpoint.
> >
> > The difference was whether the CHECKPOINT record has been decoded or not.
> > The overall workflow of this test was:
> >
> > 1. do INSERT
> > (2. do CHECKPOINT)
> > (3. decode CHECKPOINT record)
> > 4. receive feedback message from standby
> > 5. do shutdown CHECKPOINT
> >
> > At step 3, the walsender decoded that WAL and set candidate_xmin_lsn. The stucktrace was:
> > standby_decode()->SnapBuildProcessRunningXacts()->LogicalIncreaseXminForSlot().
> >
> > At step 4, the confirmed_flush of the slot was updated, but ReplicationSlotSave()
> > was executed only when the slot->candidate_xmin_lsn had valid lsn. If step 2 and
> > 3 are misssed, the dirty flag is not set and the change is still on the memory.
> >
> > FInally, the CHECKPOINT was executed at step 5. If step 2 and 3 are misssed and
> > the patch from Julien is not applied, the updated value will be discarded. This
> > is what I observed. The patch forces to save the logical slot at the shutdown
> > checkpoint, so the confirmed_lsn is save to disk at step 5.
> >
>
> I see your point but there are comments in walsender.c which indicates
> that we also wait for step-5 to get replicated. See [1] and comments
> atop walsender.c. If this is true then we don't need a special check
> as you have in patch 0003 or at least it doesn't seem to be required
> in all cases.
>

I have studied this a bit more and it seems that is true for physical
walsenders where we set the state of walsender as WALSNDSTATE_STOPPING
in XLogSendPhysical, then the checkpointer finishes writing checkpoint
record and then postmaster sends SIGUSR2 for walsender to exit. IIUC,
this whole logic of different stop states has been introduced in
commit c6c3334364 based on the discussion in the thread [1]. As per my
understanding, logical walsenders don't seem to be waiting for
shutdown checkpoint record and finishes before even we LOG that
record. It seems that the behavior of logical walsenders is different
from physical walsenders where we wait for them to send even the final
shutdown checkpoint record before they finish. If so, then we won't be
able to switchover to logical subscribers even in case of a clean
shutdown. Am, I missing something?

[1] - https://www.postgresql.org/message-id/CAHGQGwEsttg9P9LOOavoc9d6VB1zVmYgfBk%3DLjsk-UL9cEf-eA%40mail.gmail.com

--
With Regards,
Amit Kapila.

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Amit Langote 2023-07-18 09:11:06 Re: remaining sql/json patches
Previous Message Melih Mutlu 2023-07-18 09:03:38 Re: [PATCH] Reuse Workers and Replication Slots during Logical Replication