Re: Synchronizing slots from primary to standby

From: shveta malik <shveta(dot)malik(at)gmail(dot)com>
To: "Zhijie Hou (Fujitsu)" <houzj(dot)fnst(at)fujitsu(dot)com>, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, "Hayato Kuroda (Fujitsu)" <kuroda(dot)hayato(at)fujitsu(dot)com>, "Drouvot, Bertrand" <bertranddrouvot(dot)pg(at)gmail(dot)com>
Cc: Peter Smith <smithpb2250(at)gmail(dot)com>, Nisha Moond <nisha(dot)moond412(at)gmail(dot)com>, Bharath Rupireddy <bharath(dot)rupireddyforpostgres(at)gmail(dot)com>, Peter Eisentraut <peter(dot)eisentraut(at)enterprisedb(dot)com>, Bruce Momjian <bruce(at)momjian(dot)us>, Ashutosh Sharma <ashu(dot)coek88(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, Ajin Cherian <itsajin(at)gmail(dot)com>, Alvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org>, shveta malik <shveta(dot)malik(at)gmail(dot)com>
Subject: Re: Synchronizing slots from primary to standby
Date: 2023-12-22 10:32:21
Message-ID: CAJpy0uBY1x_mjqUk6dyD3iGtihwboy5mnrnL4tzZxTD3vy7X4A@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Fri, Dec 22, 2023 at 3:11 PM Zhijie Hou (Fujitsu)
<houzj(dot)fnst(at)fujitsu(dot)com> wrote:
>
> On Thursday, December 21, 2023 5:39 PM Bertrand Drouvot <bertranddrouvot(dot)pg(at)gmail(dot)com> wrote:
> >
> > On Thu, Dec 21, 2023 at 02:23:12AM +0000, Zhijie Hou (Fujitsu) wrote:
> > > On Wednesday, December 20, 2023 8:42 PM Zhijie Hou (Fujitsu)
> > <houzj(dot)fnst(at)fujitsu(dot)com> wrote:
> > > >
> > > > Attach the V51 patch set which addressed Kuroda-san's comments.
> > > > I also tried to improve the test in 0003 to make it stable.
> > >
> > > The patches conflict with a recent commit dc21234.
> > > Here is the rebased V51_2 version, there is no code changes in this version.
> > >
> >
> > Thanks!
> >
> > I've a few remarks regarding 0001:
>
> Thanks for the comments!
>
> >
> > 1 ===
> >
> > In the commit message what about replacing "Allow logical walsenders to wait
> > for the physical standbys" with "Force some logical walsenders to wait for the
> > physical standbys"?
>
> I feel 'Allow' is OK, as the GUC standby_slot_names is optional for user. ISTM, 'force'
> means we always wait for physical standbys regardless of the GUC.
>
> >
> > Also I think it would be better to first explain what we are trying to achieve and
> > after explain how we do it (adding a new flag in CREATE SUBSCRIPTION and so
> > on).
>
> Noted. We are about to split the patches, so will improve each commit message after that.
>
> >
> > 4 ===
> >
> > @@ -248,10 +262,13 @@ ReplicationSlotValidateName(const char *name, int
> > elevel)
> > * during getting changes, if the two_phase option is enabled it can skip
> > * prepare because by that time start decoding point has been moved. So
> > the
> > * user will only get commit prepared.
> > + * failover: If enabled, allows the slot to be synced to physical standbys so
> > + * that logical replication can be resumed after failover.
> >
> > s/allows/forces ?
>
> I think whether the slot is synced also depends on the
> GUC setting on standby, so I feel 'allow' is fine here.
>
> >
> > 5 ===
> >
> > + bool ok;
> >
> > parse_ok maybe?
>
> The flag is also used to store the slot type check result, so I feel 'ok' is
> better here.
>
> >
> > 6 ===
> >
> > + /* Need a modifiable copy of string. */
> > + rawname = pstrdup(*newval);
> >
> > It seems to me that the single line comments in the neighborhood functions
> > (see
> > RestoreSlotFromDisk() for example) don't finish with ".". Worth to follow the
> > same format for all what we add in slot.c?
>
> I felt we have both styles in slot.c, but it seems Kuroda-san also
> prefer removing the ".", so will address.
>
> >
> > 7 ===
> >
> > +static void
> > +parseAlterReplSlotOptions(AlterReplicationSlotCmd *cmd, bool *failover)
> >
> > ParseAlterReplSlotOptions instead?
>
> I think it followed parseCreateReplSlotOptions, but I agree that it looks
> inconsistent with other names. Will address.
>
> > 11 ===
> >
> > + * When the wait event is WAIT_FOR_STANDBY_CONFIRMATION, wait on
> > another
> > + * CV that is woken up by physical walsenders when the walreceiver has
> > + * confirmed the receipt of LSN.
> >
> > s/that is woken up by/that is broadcasted by/ ?
>
> Will reword the comment here.
>
> >
> > 12 ===
> >
> > We are mentioning in several places that the replication can be resumed after a
> > failover. Should we add a few words about possible lag? (see [1])
> >
> > [1]:
> > https://www.postgresql.org/message-id/CAA4eK1KihniOK21mEVYtSOHRQiG
> > NyToUmENWp7hPbH_PMsqzkA%40mail.gmail.com
>
> It feels like the implementation detail to me, but noted. We will think more
> about the document.
>
>
> The comments not mentioned above look good to me.
>
> Best Regards,
> Hou zj

PFA v53. Changes are:

patch001:
1) Addressed comments in [1] for v51-001. Thanks Hou-san for working on this.

patch002:
2) Addressed comments in [2] for v52-002.
3) Fixed CFBot failure. The failure was caused by an assert in
wait_for_primary_slot_catchup() for null confirmed_lsn received. In
wait_for_primary_slot_catchup(), we had an assumption that if
restart_lsn is valid and 'conflicting' is also false, then we must
have non-null confirmed_lsn. But this is not true. It is possible to
get null values for confirmed_lsn and catalog_xmin if on the primary
server the slot is just created with a valid restart_lsn and slot-sync
worker has fetched the slot before the primary server could set valid
confirmed_lsn and catalog_xmin. In
pg_create_logical_replication_slot(), there is a small window between
CreateInitDecodingContext-->ReplicationSlotReserveWal() which sets
restart_lsn and DecodingContextFindStartpoint() which sets
confirmed_lsn. If the slot-sync worker fetches the slot in this
window, confirmed_lsn received will be NULL. Corrected the code to
remove assert and added one additional condition that confirmed_lsn
should be valid before moving the slot to 'r'.

[1]: https://www.postgresql.org/message-id/ZYQHvgBpH0GgQaJK%40ip-10-97-1-34.eu-west-3.compute.internal
[2]: https://www.postgresql.org/message-id/TY3PR01MB98893274D5A4FD4F86CC04A0F595A%40TY3PR01MB9889.jpnprd01.prod.outlook.com

thanks
Shveta

Attachment Content-Type Size
v53-0001-Allow-logical-walsenders-to-wait-for-the-physica.patch application/octet-stream 147.2 KB
v53-0002-Add-logical-slot-sync-capability-to-the-physical.patch application/octet-stream 91.3 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Christoph Berg 2023-12-22 10:33:52 Re: Set log_lock_waits=on by default
Previous Message Alexander Korotkov 2023-12-22 09:48:06 Re: Optimization outcome depends on the index order