RE: Synchronizing slots from primary to standby

From: "Zhijie Hou (Fujitsu)" <houzj(dot)fnst(at)fujitsu(dot)com>
To: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
Cc: shveta malik <shveta(dot)malik(at)gmail(dot)com>, Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>, Ajin Cherian <itsajin(at)gmail(dot)com>, Dilip Kumar <dilipbalaut(at)gmail(dot)com>, Nisha Moond <nisha(dot)moond412(at)gmail(dot)com>, "Hayato Kuroda (Fujitsu)" <kuroda(dot)hayato(at)fujitsu(dot)com>, Peter Smith <smithpb2250(at)gmail(dot)com>, Bertrand Drouvot <bertranddrouvot(dot)pg(at)gmail(dot)com>, Bharath Rupireddy <bharath(dot)rupireddyforpostgres(at)gmail(dot)com>, Peter Eisentraut <peter(dot)eisentraut(at)enterprisedb(dot)com>, Bruce Momjian <bruce(at)momjian(dot)us>, Ashutosh Sharma <ashu(dot)coek88(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, Alvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org>
Subject: RE: Synchronizing slots from primary to standby
Date: 2024-03-01 02:17:26
Message-ID: OS0PR01MB5716E281A55E5053DF4976D1945E2@OS0PR01MB5716.jpnprd01.prod.outlook.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Thursday, February 29, 2024 7:17 PM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
>
> On Thu, Feb 29, 2024 at 3:23 PM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
> wrote:
> >
> > Few additional comments on the latest patch:
> > =================================
> > 1.
> > static XLogRecPtr
> > WalSndWaitForWal(XLogRecPtr loc)
> > {
> > ...
> > + if (!XLogRecPtrIsInvalid(RecentFlushPtr) && loc <= RecentFlushPtr &&
> > + (!replication_active || StandbyConfirmedFlush(loc, WARNING))) {
> > + /*
> > + * Fast path to avoid acquiring the spinlock in case we already know
> > + * we have enough WAL available and all the standby servers have
> > + * confirmed receipt of WAL up to RecentFlushPtr. This is
> > + particularly
> > + * interesting if we're far behind.
> > + */
> > return RecentFlushPtr;
> > + }
> > ...
> > ...
> > + * Wait for WALs to be flushed to disk only if the postmaster has not
> > + * asked us to stop.
> > + */
> > + if (loc > RecentFlushPtr && !got_STOPPING) wait_event =
> > + WAIT_EVENT_WAL_SENDER_WAIT_FOR_WAL;
> > +
> > + /*
> > + * Check if the standby slots have caught up to the flushed position.
> > + * It is good to wait up to RecentFlushPtr and then let it send the
> > + * changes to logical subscribers one by one which are already
> > + covered
> > + * in RecentFlushPtr without needing to wait on every change for
> > + * standby confirmation. Note that after receiving the shutdown
> > + signal,
> > + * an ERROR is reported if any slots are dropped, invalidated, or
> > + * inactive. This measure is taken to prevent the walsender from
> > + * waiting indefinitely.
> > + */
> > + else if (replication_active &&
> > + !StandbyConfirmedFlush(RecentFlushPtr, WARNING)) { wait_event =
> > + WAIT_EVENT_WAIT_FOR_STANDBY_CONFIRMATION;
> > + wait_for_standby = true;
> > + }
> > + else
> > + {
> > + /* Already caught up and doesn't need to wait for standby_slots. */
> > break;
> > + }
> > ...
> > }
> >
> > Can we try to move these checks into a separate function that returns
> > a boolean and has an out parameter as wait_event?
> >
> > 2. How about naming StandbyConfirmedFlush() as
> StandbySlotsAreCaughtup?
> >
>
> Some more comments:
> ==================
> 1.
> + foreach_ptr(char, name, elemlist)
> + {
> + ReplicationSlot *slot;
> +
> + slot = SearchNamedReplicationSlot(name, true);
> +
> + if (!slot)
> + {
> + GUC_check_errdetail("replication slot \"%s\" does not exist", name);
> + ok = false; break; }
> +
> + if (!SlotIsPhysical(slot))
> + {
> + GUC_check_errdetail("\"%s\" is not a physical replication slot",
> + name); ok = false; break; } }
>
> Won't the second check need protection via ReplicationSlotControlLock?

Yes, added.

>
> 2. In WaitForStandbyConfirmation(), we are anyway calling
> StandbyConfirmedFlush() before the actual sleep in the loop, so does calling it
> at the beginning of the function will serve any purpose? If so, it is better to add
> some comments explaining the same.

It is used as a fast-path to avoid calling condition variable stuff, I think we can directly
put failover check and list check in the beginning instead of calling that function.

>
> 3. Also do we need to perform the validation checks done in
> StandbyConfirmedFlush() repeatedly when it is invoked in a loop? We can
> probably separate those checks and perform them just once.

I have removed slot.failover check from the StandbyConfirmedFlush
function, so that we can do it when necessary. I didn’t change the check
for standby_slot_names_list because the list could be changed in the loop
when reloading config.

>
> 4.
> +   *
> +   * XXX: If needed, this can be attempted in future.
>
> Remove this part of the comment.

Removed.

Attach the V102 patch set which addressed Amit and Shveta's comments.
Thanks Shveta for helping addressing the comments off-list.

Best Regards,
Hou zj

Attachment Content-Type Size
v102-0002-Document-the-steps-to-check-if-the-standby-is-r.patch application/octet-stream 7.0 KB
v102-0001-Allow-logical-walsenders-to-wait-for-the-physic.patch application/octet-stream 41.9 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Zhijie Hou (Fujitsu) 2024-03-01 02:18:18 RE: Synchronizing slots from primary to standby
Previous Message Michael Paquier 2024-03-01 02:05:27 Re: Propagate sanity checks of ProcessUtility() to standard_ProcessUtility()?