Re: [PoC] pg_upgrade: allow to upgrade publisher node

From: vignesh C <vignesh21(at)gmail(dot)com>
To: "Hayato Kuroda (Fujitsu)" <kuroda(dot)hayato(at)fujitsu(dot)com>
Cc: PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Julien Rouhaud <rjuju123(at)gmail(dot)com>, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
Subject: Re: [PoC] pg_upgrade: allow to upgrade publisher node
Date: 2023-07-28 11:59:06
Message-ID: CALDaNm1TLfZZ5aUnhP477KTdK7xgzwZJ2m0b=73nJNzQNM8SdA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Fri, 21 Jul 2023 at 13:00, Hayato Kuroda (Fujitsu)
<kuroda(dot)hayato(at)fujitsu(dot)com> wrote:
>
> Dear hackers,
>
> > Based on the above, we are considering that we delay the timing of shutdown for
> > logical walsenders. The preliminary workflow is:
> >
> > 1. When logical walsenders receives siginal from checkpointer, it consumes all
> > of WAL records, change its state into WALSNDSTATE_STOPPING, and stop
> > doing
> > anything.
> > 2. Then the checkpointer does the shutdown checkpoint
> > 3. After that postmaster sends signal to walsenders, same as current
> > implementation.
> > 4. Finally logical walsenders process the shutdown checkpoint record and update
> > the
> > confirmed_lsn after the acknowledgement from subscriber.
> > Note that logical walsenders don't have to send a shutdown checkpoint record
> > to subscriber but following keep_alive will help us to increment the
> > confirmed_lsn.
> > 5. All tasks are done, they exit.
> >
> > This mechanism ensures that the confirmed_lsn of active slots is same as the
> > current
> > WAL location of old publisher, so that 0003 patch would become more simpler.
> > We would not have to calculate the acceptable difference anymore.
> >
> > One thing we must consider is that any WALs must not be generated while
> > decoding
> > the shutdown checkpoint record. It causes the PANIC. IIUC the record leads
> > SnapBuildSerializationPoint(), which just serializes snapbuild or restores from
> > it, so the change may be acceptable. Thought?
>
> I've implemented the ideas from my previous proposal, PSA another patch set.
> Patch 0001 introduces the state WALSNDSTATE_STOPPING to logical walsenders. The
> workflow remains largely the same as described in my previous post, with the
> following additions:
>
> * A flag has been added to track whether all the WALs have been flushed. The
> logical walsender can only exit after the flag is set. This ensures that all
> WALs are flushed before the termination of the walsender.
> * Cumulative statistics are now forcibly written before changing the state.
> While the previous involved reporting stats upon process exit, the current approach
> must report earlier due to the checkpointer's termination timing. See comments
> in CheckpointerMain() and atop pgstat_before_server_shutdown().
> * At the end of processes, slots are now saved to disk.
>
>
> Patch 0002 adds --include-logical-replication-slots option to pg_upgrade,
> not changed from previous set.
>
> Patch 0003 adds a check function, which becomes simpler.
> The previous version calculated the "acceptable" difference between confirmed_lsn
> and the current WAL position. This was necessary because shutdown records could
> not be sent to subscribers, creating a disparity in these values. However, this
> approach had drawbacks, such as needing adjustments if record sizes changed.
>
> Now, the record can be sent to subscribers, so the hacking is not needed anymore,
> at least in the context of logical replication. The consistency is now maintained
> by the logical walsenders, so slots created by the backend could not be.
> We must consider what should be...
>
> How do you think?

Here is a patch which checks that there are no WAL records other than
CHECKPOINT_SHUTDOWN WAL record to be consumed based on the discussion
from [1].
Patch 0001 and 0002 is same as the patch posted by Kuroda-san, Patch
0003 exposes pg_get_wal_records_content to get the WAL records along
with the WAL record type between start and end lsn. pg_walinspect
contrib module already exposes a function for this requirement, I have
moved this functionality to be exposed from the backend. Patch 0004
has slight change in check function to check that there are no other
records other than CHECKPOINT_SHUTDOWN to be consumed. The attached
patch has the changes for the same.
Thoughts?

[1] - https://www.postgresql.org/message-id/CAA4eK1Kem-J5NM7GJCgyKP84pEN6RsG6JWo%3D6pSn1E%2BiexL1Fw%40mail.gmail.com

Regards,
Vignesh

Attachment Content-Type Size
0004-pg_upgrade-Add-check-function-for-include-logical-re.patch text/x-patch 11.9 KB
0002-pg_upgrade-Add-include-logical-replication-slots-opt.patch text/x-patch 34.0 KB
0001-Send-shutdown-checkpoint-record-to-subscriber.patch text/x-patch 3.0 KB
0003-Move-pg_get_wal_records_info-functionality-from-pg_w.patch text/x-patch 31.2 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Ashutosh Bapat 2023-07-28 12:35:54 Re: logical decoding and replication of sequences, take 2
Previous Message Anthonin Bonnefoy 2023-07-28 11:45:24 Re: POC: Extension for adding distributed tracing - pg_tracing