Re: Synchronizing slots from primary to standby

From: shveta malik <shveta(dot)malik(at)gmail(dot)com>
To: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
Cc: Peter Smith <smithpb2250(at)gmail(dot)com>, "Drouvot, Bertrand" <bertranddrouvot(dot)pg(at)gmail(dot)com>, Nisha Moond <nisha(dot)moond412(at)gmail(dot)com>, "Zhijie Hou (Fujitsu)" <houzj(dot)fnst(at)fujitsu(dot)com>, "Hayato Kuroda (Fujitsu)" <kuroda(dot)hayato(at)fujitsu(dot)com>, Bharath Rupireddy <bharath(dot)rupireddyforpostgres(at)gmail(dot)com>, Peter Eisentraut <peter(dot)eisentraut(at)enterprisedb(dot)com>, Bruce Momjian <bruce(at)momjian(dot)us>, Ashutosh Sharma <ashu(dot)coek88(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, Ajin Cherian <itsajin(at)gmail(dot)com>, Alvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org>, shveta malik <shveta(dot)malik(at)gmail(dot)com>
Subject: Re: Synchronizing slots from primary to standby
Date: 2023-12-19 11:47:15
Message-ID: CAJpy0uB9SfgFnOekSnysfJTToyWQMGXcwu5803HFREEW_FDT0A@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Mon, Dec 18, 2023 at 4:22 PM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
>
> On Fri, Dec 15, 2023 at 11:03 AM shveta malik <shveta(dot)malik(at)gmail(dot)com> wrote:
> >
> > Sorry, I missed attaching the patch. PFA v48.
> >
>
> Few comments on v48_0002
> ========================

Thanks for reviewing. These are addressed in v50. Please find my
comments inline for some of these.

> 1.
> +static void
> +slotsync_reread_config(WalReceiverConn *wrconn)
> {
> ...
> + pfree(old_primary_conninfo);
> + pfree(old_primary_slotname);
> +
> + if (restart)
> + {
> + char *msg = "slot sync worker will restart because of a parameter change";
> +
> + /*
> + * The exit code 1 will make postmaster restart the slot sync worker.
> + */
> + slotsync_worker_exit(msg, 1 /* proc_exit code */ );
> + }
> ...
>
> I don't see the need to explicitly pfree in case we are already
> exiting the process because anyway the memory will be released. We can
> avoid using the 'restart' variable for this.

I have moved pfree to the end where we do not exit the worker. Removed
restart variable.

>Also, probably, directly
> exiting here makes sense and at another place where this function is
> used. I see that in maybe_reread_subscription(), we exit with a 0 code
> and still apply worker restarts, so why use a different exit code
> here?
>

Logical rep worker is started by logical rep launcher and it has
different logic of restarting it. OTOH, slot-sync worker is started by
the postmaster and the postmaster starts any of its bgworkers only if
the worker had an abnormal exit and restart_time is given during
registration of the worker. Thus we need exit_code here. I have
removed the new function added though.

> 2.
> +static void
> +check_primary_info(WalReceiverConn *wrconn, bool *am_cascading_standby)
> {
> ...
> + remote_in_recovery = DatumGetBool(slot_getattr(tupslot, 1, &isnull));
> + Assert(!isnull);
> +
> + /* No need to check further, return that we are cascading standby */
> + if (remote_in_recovery)
> + {
> + *am_cascading_standby = true;
> + CommitTransactionCommand();
> + return;
> ...
> }
>
> Don't we need to clear the result and tuple in case of early return?

Yes, it was needed. Modified.

>
> 3. It would be a good idea to mention about requirements like a
> physical slot on primary, hot_standby_feedback, etc. in the commit
> message.
>
> 4.
> +static bool
> +wait_for_primary_slot_catchup(WalReceiverConn *wrconn, RemoteSlot *remote_slot,
> + bool *wait_attempts_exceeded)
> {
> ...
> + tupslot = MakeSingleTupleTableSlot(res->tupledesc, &TTSOpsMinimalTuple);
> + if (!tuplestore_gettupleslot(res->tuplestore, true, false, tupslot))
> + {
> + ereport(WARNING,
> + (errmsg("slot \"%s\" creation aborted", remote_slot->name),
> + errdetail("This slot was not found on the primary server")));
> ...
> + /*
> + * It is possible to get null values for LSN and Xmin if slot is
> + * invalidated on the primary server, so handle accordingly.
> + */
> + new_invalidated = DatumGetBool(slot_getattr(tupslot, 1, &isnull));
> + Assert(!isnull);
> +
> + new_restart_lsn = DatumGetLSN(slot_getattr(tupslot, 2, &isnull));
> + if (new_invalidated || isnull)
> + {
> + ereport(WARNING,
> + (errmsg("slot \"%s\" creation aborted", remote_slot->name),
> + errdetail("This slot was invalidated on the primary server")));
> ...
> }
>
> a. The errdetail message should end with a full stop. Please check all
> other errdetail messages in the patch to follow the same guideline.
> b. I think saying slot creation aborted is not completely correct
> because we would have created the slot especially when it is in 'i'
> state. Can we change it to something like: "aborting initial sync for
> slot \"%s\""?
> c. Also, if the remote_slot is invalidated, ideally, we can even drop
> the local slot but it seems that the patch will drop the same before
> the next-sync cycle with any other slot that needs to be dropped. If
> so, can we add the comment to indicate the same?

I have added comments. Basically, it will be dropped in caller only if
it is 'RS_EPHEMERAL' state else if it is already persisted, then will
be maintained as is but marked as invalidated in caller and its sync
will be skipped next time onwards.

>
> 5.
> +static void
> +local_slot_update(RemoteSlot *remote_slot)
> +{
> + Assert(MyReplicationSlot->data.invalidated == RS_INVAL_NONE);
> +
> + LogicalConfirmReceivedLocation(remote_slot->confirmed_lsn);
> + LogicalIncreaseXminForSlot(remote_slot->confirmed_lsn,
> + remote_slot->catalog_xmin);
> + LogicalIncreaseRestartDecodingForSlot(remote_slot->confirmed_lsn,
> + remote_slot->restart_lsn);
> +
> + SpinLockAcquire(&MyReplicationSlot->mutex);
> + MyReplicationSlot->data.invalidated = remote_slot->invalidated;
> + SpinLockRelease(&MyReplicationSlot->mutex);
> ...
> ...
>
> If required, the invalidated flag is updated in the caller as well, so
> why do we need to update it here as well?
>

It was needed by the part where the slot is not existing and we need
to create a new slot. I have now moved the invalidation check in
caller; we do not create the slot itself if remote_slot is found as
invalidated one in the beginning. And if it is invalidated in between
of the wait logic, then it will be dropped by ReplicationSlotRelease.

> --
> With Regards,
> Amit Kapila.

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message shveta malik 2023-12-19 11:58:07 Re: Synchronizing slots from primary to standby
Previous Message shveta malik 2023-12-19 11:40:57 Re: Synchronizing slots from primary to standby