Re: Synchronizing slots from primary to standby

From: shveta malik <shveta(dot)malik(at)gmail(dot)com>
To: "Zhijie Hou (Fujitsu)" <houzj(dot)fnst(at)fujitsu(dot)com>
Cc: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, "Drouvot, Bertrand" <bertranddrouvot(dot)pg(at)gmail(dot)com>, Peter Smith <smithpb2250(at)gmail(dot)com>, "Hayato Kuroda (Fujitsu)" <kuroda(dot)hayato(at)fujitsu(dot)com>, Bharath Rupireddy <bharath(dot)rupireddyforpostgres(at)gmail(dot)com>, Peter Eisentraut <peter(dot)eisentraut(at)enterprisedb(dot)com>, Bruce Momjian <bruce(at)momjian(dot)us>, Ashutosh Sharma <ashu(dot)coek88(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, Ajin Cherian <itsajin(at)gmail(dot)com>, Alvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org>, shveta malik <shveta(dot)malik(at)gmail(dot)com>
Subject: Re: Synchronizing slots from primary to standby
Date: 2023-11-21 05:45:31
Message-ID: CAJpy0uDj_97bhm=rQsN4ErhHLZzgNCrhdLYaVdFX8ZeCoaGJzw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, Nov 21, 2023 at 10:01 AM Zhijie Hou (Fujitsu)
<houzj(dot)fnst(at)fujitsu(dot)com> wrote:
>
> On Saturday, November 18, 2023 6:46 PM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
> >
> > On Fri, Nov 17, 2023 at 5:18 PM Drouvot, Bertrand
> > <bertranddrouvot(dot)pg(at)gmail(dot)com> wrote:
> > >
> > > On 11/17/23 2:46 AM, Zhijie Hou (Fujitsu) wrote:
> > > > On Tuesday, November 14, 2023 10:27 PM Drouvot, Bertrand
> > <bertranddrouvot(dot)pg(at)gmail(dot)com> wrote:
> > > >
> > > > I feel the WaitForWALToBecomeAvailable may not be the best place to
> > > > shutdown slotsync worker and drop slots. There could be other
> > > > reasons(other than
> > > > promotion) as mentioned in comments in case XLOG_FROM_STREAM to
> > > > reach the code there. I thought if the intention is to stop slotsync
> > > > workers on promotion, maybe FinishWalRecovery() is a better place to
> > > > do it as it's indicating the end of recovery and XLogShutdownWalRcv is also
> > called in it.
> > >
> > > I can see that slotsync_drop_initiated_slots() has been moved in
> > > FinishWalRecovery() in v35. That looks ok.
> > > >
> >
> >
> > More Review for v35-0002*
>
> Thanks for the comments.
>
> > ============================
> > 1.
> > + ereport(WARNING,
> > + errmsg("skipping slots synchronization as primary_slot_name "
> > + "is not set."));
> >
> > There is no need to use a full stop at the end for WARNING messages and as
> > previously mentioned, let's not split message lines in such cases. There are
> > other messages in the patch with similar problems, please fix those as well.
>
> Adjusted.
>
> >
> > 2.
> > +slotsync_checks()
> > {
> > ...
> > ...
> > + /* The hot_standby_feedback must be ON for slot-sync to work */ if
> > + (!hot_standby_feedback) { ereport(WARNING, errmsg("skipping slots
> > + synchronization as hot_standby_feedback "
> > + "is off."));
> >
> > This message has the same problem as mentioned in the previous comment.
> > Additionally, I think either atop slotsync_checks or along with GUC check we
> > should write comments as to why we expect these values to be set for slot sync
> > to work.
>
> Added comments for these cases.
>
> >
> > 3.
> > + /* The worker is running already */
> > + if (SlotSyncWorker &&SlotSyncWorker->hdr.in_use &&
> > + SlotSyncWorker->hdr.proc)
> >
> > The spacing for both the &&'s has problems. You need a space after the first
> > && and the second && should be in the prior line.
>
> Adjusted.
>
> >
> > 4.
> > + LauncherRereadConfig(&recheck_slotsync);
> > +
> > }
> >
> > An empty line after LauncherRereadConfig() is not required.
> >
> > 5.
> > +static void
> > +LauncherRereadConfig(bool *ss_recheck)
> > +{
> > + char *conninfo = pstrdup(PrimaryConnInfo);
> > + char *slotname = pstrdup(PrimarySlotName);
> > + bool syncslot = enable_syncslot;
> > + bool feedback = hot_standby_feedback;
> >
> > Can we change the variable name 'feedback' to 'standbyfeedback' to make it
> > slightly more descriptive?
>
> Changed.
>
> >
> > 6. The logic to recheck the slot_sync related parameters in
> > LauncherMain() is not very clear. IIUC, if after reload config any parameter is
> > changed, we just seem to be checking the validity of the changed parameter
> > but not restarting the slot sync worker, is that correct? If so, what if dbname is
> > changed, don't we need to restart the slot-sync worker and re-initialize the
> > connection; similarly slotname change also needs some thoughts. Also, if all the
> > parameters are valid we seem to be re-launching the slot-sync worker without
> > first stopping it which doesn't seem correct, am I missing something in this
> > logic?
>
> I think the slot sync worker will be stopped in LauncherRereadConfig() if GUC changed
> and new slot sync worker will be started in next loop in LauncherMain().

yes, LauncherRereadConfig will stop the worker on any parameter change
and will set recheck_slotsync(). On finding this flag as true,
LauncherMain will redo all the validations and restart slot-sync
worker if needed. Yes, we do need to stop and relaunch slotsync
workers on dbname change as well. This is currently missing
inLauncherRereadConfig (). Regarding slot name change,we are already
doing it, we are already checking PrimarySlotName in
LauncherRereadConfig()

>
>
> > 7.
> > @@ -524,6 +525,25 @@ CreateDecodingContext(XLogRecPtr start_lsn,
> > errmsg("replication slot \"%s\" was not created in this database",
> > NameStr(slot->data.name))));
> >
> > + in_recovery = RecoveryInProgress();
> > +
> > + /*
> > + * Do not allow consumption of a "synchronized" slot until the standby
> > + * gets promoted. Also do not allow consumption of slots with
> > + sync_state
> > + * as SYNCSLOT_STATE_INITIATED as they are not synced completely to be
> > + * used.
> > + */
> > + if ((in_recovery && (slot->data.sync_state != SYNCSLOT_STATE_NONE)) ||
> > + slot->data.sync_state == SYNCSLOT_STATE_INITIATED)
> > + ereport(ERROR,
> > + (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
> > + errmsg("cannot use replication slot \"%s\" for logical decoding",
> > + NameStr(slot->data.name)), in_recovery ?
> > + errdetail("This slot is being synced from the primary server.") :
> > + errdetail("This slot was not synced completely from the primary
> > + server."), errhint("Specify another replication slot.")));
> > +
> >
> > If we are planning to drop slots in state SYNCSLOT_STATE_INITIATED at the
> > time of promotion, don't we need to just have an assert or elog(ERROR, .. for
> > non-recovery cases as such cases won't be reachable? If so, I think we can
> > separate out that case here.
>
> Adjusted the codes as suggested.
>
> >
> > 8.
> > wait_for_primary_slot_catchup()
> > {
> > ...
> > + /* Check if this standby is promoted while we are waiting */ if
> > + (!RecoveryInProgress()) {
> > + /*
> > + * The remote slot didn't pass the locally reserved position at
> > + * the time of local promotion, so it's not safe to use.
> > + */
> > + ereport(
> > + WARNING,
> > + (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
> > + errmsg(
> > + "slot-sync wait for slot %s interrupted by promotion, "
> > + "slot creation aborted", remote_slot->name))); pfree(cmd.data); return
> > + false; }
> > ...
> > }
> >
> > Shouldn't this be an Assert because a slot-sync worker shouldn't exist for
> > non-standby servers?
>
> Changed to Assert.
>
> >
> > 9.
> > wait_for_primary_slot_catchup()
> > {
> > ...
> > + slot = MakeSingleTupleTableSlot(res->tupledesc, &TTSOpsMinimalTuple);
> > + if (!tuplestore_gettupleslot(res->tuplestore, true, false, slot)) {
> > + ereport(WARNING, (errmsg("slot \"%s\" disappeared from the primary
> > + server,"
> > + " slot creation aborted", remote_slot->name))); pfree(cmd.data);
> > + walrcv_clear_result(res); return false;
> >
> > If the slot on primary disappears, shouldn't this part of the code somehow
> > ensure to remove the slot on standby as well? If it is taken at some other point
> > in time then at least we should write a comment here to state how it is taken
> > care of. I think this comment also applies to a few other checks following this
> > check.
>
> I adjusted the code here to not persist the slots if the slot disappeared or invalidated
> on primary, so that the local slot will get dropped when releasing.
>
> >
> > 10.
> > + /*
> > + * It is possible to get null values for lsns and xmin if slot is
> > + * invalidated on the primary server, so handle accordingly.
> > + */
> > + new_invalidated = DatumGetBool(slot_getattr(slot, 1, &isnull));
> >
> > We can say LSN and Xmin in the above comment to make it easier to
> > read/understand.
>
> Changed.
>
> >
> > 11.
> > /*
> > + * Once we got valid restart_lsn, then confirmed_lsn and catalog_xmin
> > + * are expected to be valid/non-null, so assert if found null.
> > + */
> >
> > No need to explicitly say about assert, it is clear from the code. We can slightly
> > change this comment to: "Once we got valid restart_lsn, then confirmed_lsn
> > and catalog_xmin are expected to be valid/non-null."
>
> Changed.
>
> >
> > 12.
> > + if (remote_slot->restart_lsn < MyReplicationSlot->data.restart_lsn ||
> > + TransactionIdPrecedes(remote_slot->catalog_xmin,
> > + MyReplicationSlot->data.catalog_xmin))
> > + {
> > + if (!wait_for_primary_slot_catchup(wrconn, remote_slot)) {
> > + /*
> > + * The remote slot didn't catch up to locally reserved position.
> > + * But still persist it and attempt the wait and sync in next
> > + * sync-cycle.
> > + */
> > + if (MyReplicationSlot->data.persistency != RS_PERSISTENT) {
> > + ReplicationSlotPersist(); *slot_updated = true; }
> >
> > I think the reason to persist in this case is because next time local restart_lsn can
> > be ahead than the current location and it can take more time to create such a
> > slot. We can probably mention the same in the comments.
>
> Updated the comments.
>
> Here is the V37 patch set which addressed comments above and [1].
>
> [1] https://www.postgresql.org/message-id/CAA4eK1%2BP9R3GO2rwGBg2EOh%3DuYjWUSEOHD8yvs4Je8WYa2RHag%40mail.gmail.com
>
> Best Regards,
> Hou zj

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Peter Smith 2023-11-21 05:52:32 Re: Simplify if/else logic of walsender CreateReplicationSlot
Previous Message Peter Smith 2023-11-21 05:39:21 Re: Synchronizing slots from primary to standby