RE: Synchronizing slots from primary to standby

From: "Zhijie Hou (Fujitsu)" <houzj(dot)fnst(at)fujitsu(dot)com>
To: Peter Smith <smithpb2250(at)gmail(dot)com>
Cc: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, shveta malik <shveta(dot)malik(at)gmail(dot)com>, Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>, Ajin Cherian <itsajin(at)gmail(dot)com>, Dilip Kumar <dilipbalaut(at)gmail(dot)com>, Nisha Moond <nisha(dot)moond412(at)gmail(dot)com>, "Hayato Kuroda (Fujitsu)" <kuroda(dot)hayato(at)fujitsu(dot)com>, Bertrand Drouvot <bertranddrouvot(dot)pg(at)gmail(dot)com>, Bharath Rupireddy <bharath(dot)rupireddyforpostgres(at)gmail(dot)com>, Peter Eisentraut <peter(dot)eisentraut(at)enterprisedb(dot)com>, Bruce Momjian <bruce(at)momjian(dot)us>, Ashutosh Sharma <ashu(dot)coek88(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, Alvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org>
Subject: RE: Synchronizing slots from primary to standby
Date: 2024-03-03 07:51:38
Message-ID: OS0PR01MB5716B382E3DA4B79C937BC90945C2@OS0PR01MB5716.jpnprd01.prod.outlook.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Sunday, March 3, 2024 7:47 AM Peter Smith <smithpb2250(at)gmail(dot)com> wrote:
>
> On Sat, Mar 2, 2024 at 2:51 PM Zhijie Hou (Fujitsu) <houzj(dot)fnst(at)fujitsu(dot)com>
> wrote:
> >
> > On Friday, March 1, 2024 12:23 PM Peter Smith <smithpb2250(at)gmail(dot)com>
> wrote:
> > >
> ...
> > > ======
> > > src/backend/replication/slot.c
> > >
> > > 2. validate_standby_slots
> > >
> > > + else if (!ReplicationSlotCtl)
> > > + {
> > > + /*
> > > + * We cannot validate the replication slot if the replication slots'
> > > + * data has not been initialized. This is ok as we will validate
> > > + the
> > > + * specified slot when waiting for them to catch up. See
> > > + * StandbySlotsHaveCaughtup for details.
> > > + */
> > > + }
> > > + else
> > > + {
> > > + /*
> > > + * If the replication slots' data have been initialized, verify if
> > > + the
> > > + * specified slots exist and are logical slots.
> > > + */
> > > + LWLockAcquire(ReplicationSlotControlLock, LW_SHARED);
> > >
> > > IMO that 2nd comment does not need to say "If the replication slots'
> > > data have been initialized," because that is implicit from the if/else.
> >
> > Removed.
>
> I only meant to suggest removing the 1st part, not the entire comment.
> I thought it is still useful to have a comment like:
>
> /* Check that the specified slots exist and are logical slots.*/

OK, I misunderstood. Fixed.

>
> ======
>
> And, here are some review comments for v103-0001.

Thanks for the comments.

>
> ======
> Commit message
>
> 1.
> Additionally, The SQL functions pg_logical_slot_get_changes,
> pg_logical_slot_peek_changes and pg_replication_slot_advance are modified
> to wait for the replication slots mentioned in 'standby_slot_names' to catch up
> before returning.
>
> ~
>
> (use the same wording as previous in this message)
>
> /mentioned in/specified in/

Changed.

>
> ======
> doc/src/sgml/config.sgml
>
> 2.
> + Additionally, when using the replication management functions
> + <link linkend="pg-replication-slot-advance">
> + <function>pg_replication_slot_advance</function></link>,
> + <link linkend="pg-logical-slot-get-changes">
> + <function>pg_logical_slot_get_changes</function></link>, and
> + <link linkend="pg-logical-slot-peek-changes">
> + <function>pg_logical_slot_peek_changes</function></link>,
> + with failover enabled logical slots, the functions will wait for the
> + physical slots specified in
> <varname>standby_slot_names</varname> to
> + confirm WAL receipt before proceeding.
> + </para>
>
> It says "the ... functions" twice so maybe reword it slightly.
>
> BEFORE
> Additionally, when using the replication management functions
> pg_replication_slot_advance, pg_logical_slot_get_changes, and
> pg_logical_slot_peek_changes, with failover enabled logical slots, the functions
> will wait for the physical slots specified in standby_slot_names to confirm WAL
> receipt before proceeding.
>
> SUGGESTION
> Additionally, the replication management functions
> pg_replication_slot_advance, pg_logical_slot_get_changes, and
> pg_logical_slot_peek_changes, when used with failover enabled logical slots,
> will wait for the physical slots specified in standby_slot_names to confirm WAL
> receipt before proceeding.
>
> (Actually the "will wait ... before proceeding" is also a bit tricky, so below is
> another possible rewording)
>
> SUGGESTION #2
> Additionally, the replication management functions
> pg_replication_slot_advance, pg_logical_slot_get_changes, and
> pg_logical_slot_peek_changes, when used with failover enabled logical slots,
> will block until all physical slots specified in standby_slot_names have
> confirmed WAL receipt.

I prefer the #2 version.

>
> ~~~
>
> 3.
> + <note>
> + <para>
> + Value <literal>*</literal> is not accepted as it is inappropriate to
> + block logical replication for physical slots that either lack
> + associated standbys or have standbys associated that are not
> enabled
> + for replication slot synchronization. (see
> + <xref
> linkend="logicaldecoding-replication-slots-synchronization"/>).
> + </para>
> + </note>
>
> Why does the document need to provide an excuse/reason for the rule?
> You could just say something like:
>
> SUGGESTION
> The slots must be named explicitly. For example, specifying wildcard values like
> <literal>*</literal> is not permitted.

As suggested by Amit, I moved this to code comments.

>
> ======
> doc/src/sgml/func.sgml
>
> 4.
> @@ -28150,7 +28150,7 @@ postgres=# SELECT '0/0'::pg_lsn +
> pd.segment_number * ps.setting::int + :offset
> </row>
>
> <row>
> - <entry role="func_table_entry"><para role="func_signature">
> + <entry id="pg-logical-slot-get-changes"
> role="func_table_entry"><para role="func_signature">
> <indexterm>
> <primary>pg_logical_slot_get_changes</primary>
> </indexterm>
> @@ -28177,7 +28177,7 @@ postgres=# SELECT '0/0'::pg_lsn +
> pd.segment_number * ps.setting::int + :offset
> </row>
>
> <row>
> - <entry role="func_table_entry"><para role="func_signature">
> + <entry id="pg-logical-slot-peek-changes"
> role="func_table_entry"><para role="func_signature">
> <indexterm>
> <primary>pg_logical_slot_peek_changes</primary>
> </indexterm>
> @@ -28232,7 +28232,7 @@ postgres=# SELECT '0/0'::pg_lsn +
> pd.segment_number * ps.setting::int + :offset
> </row>
>
> <row>
> - <entry role="func_table_entry"><para role="func_signature">
> + <entry id="pg-replication-slot-advance"
> role="func_table_entry"><para role="func_signature">
> <indexterm>
> <primary>pg_replication_slot_advance</primary>
> </indexterm>
>
> Should these 3 functions say something about how their behaviour is affected
> by 'standby_slot_names' and give a link back to the GUC 'standby_slot_names'
> docs?

I added the info for pg_logical_slot_get_changes() and
pg_replication_slot_advance(). The pg_logical_slot_peek_changes() function
clarifies that it behaves just like pg_logical_slot_get_changes(), so I didn’t
touch it.

>
> ======
> src/backend/replication/slot.c
>
> 5. StandbySlotsHaveCaughtup
>
> + if (!slot)
> + {
> + /*
> + * If the provided slot does not exist, report a message and exit
> + * the loop. It is possible for a user to specify a slot name in
> + * standby_slot_names that does not exist just before the server
> + * startup. The GUC check_hook(validate_standby_slots) cannot
> + * validate such a slot during startup as the ReplicationSlotCtl
> + * shared memory is not initialized at that time. It is also
> + * possible for a user to drop the slot in standby_slot_names
> + * afterwards.
> + */
>
> 5a.
> Minor rewording to make this code comment more similar to the next one:
>
> SUGGESTION
> If a slot name provided in standby_slot_names does not exist, report a message
> and exit the loop. A user can specify a slot name that does not exist just before
> the server startup...

Changed.

>
> ~
>
> 5b.
> + /*
> + * If a logical slot name is provided in standby_slot_names,
> + * report a message and exit the loop. Similar to the non-existent
> + * case, it is possible for a user to specify a logical slot name
> + * in standby_slot_names before the server startup, or drop an
> + * existing physical slot and recreate a logical slot with the
> + * same name.
> + */
>
> /it is possible for a user to specify/a user can specify/

Changed.

>
> ~~~
>
> 6. WaitForStandbyConfirmation
>
> + /*
> + * We wait for the slots in the standby_slot_names to catch up, but we
> + * use a timeout (1s) so we can also check if the standby_slot_names
> + * has been changed.
> + */
>
> Remove some of the "we".
>
> SUGGESTION
> Wait for the slots in the standby_slot_names to catch up, but use a timeout (1s)
> so we can also check if the standby_slot_names has been changed.

Changed.

>
> ======
> src/backend/replication/walsender.c
>
> 7. NeedToWaitForStandby
>
> +/*
> + * Returns true if not all standbys have caught up to the flushed
> +position
> + * (flushed_lsn) when failover_slot is true; otherwise, returns false.
> + *
> + * If returning true, the function sets the appropriate wait event in
> + * wait_event; otherwise, wait_event is set to 0.
> + */
>
> The function comment refers to 'failover_slot' but IMO needs to be worded
> differently because 'failover_slot' is not a parameter anymore.

Changed.

>
> ~~~
>
> 8. NeedToWaitForWal
>
> +/*
> + * Returns true if we need to wait for WALs to be flushed to disk, or
> +if not
> + * all standbys have caught up to the flushed position (flushed_lsn)
> +when
> + * failover_slot is true; otherwise, returns false.
> + *
> + * If returning true, the function sets the appropriate wait event in
> + * wait_event; otherwise, wait_event is set to 0.
> + */
> +static bool
> +NeedToWaitForWal(XLogRecPtr target_lsn, XLogRecPtr flushed_lsn,
> + uint32 *wait_event)
>
> Same as above -- Now that 'failover_slot' is not a function parameter, I thought
> this should be reworded.

Changed.

>
> ~~~
>
> 9. NeedToWaitForWal
>
> + /*
> + * Check if the standby slots have caught up to the flushed position.
> + It
> + * is good to wait up to flushed position and then let it send the
> + changes
> + * to logical subscribers one by one which are already covered in
> + flushed
> + * position without needing to wait on every change for standby
> + * confirmation. Note that after receiving the shutdown signal, an
> + ERROR
> + * is reported if any slots are dropped, invalidated, or inactive. This
> + * measure is taken to prevent the walsender from waiting indefinitely.
> + */
> + if (NeedToWaitForStandby(target_lsn, flushed_lsn, wait_event)) return
> + true;
>
> I felt it was confusing things for this function to also call to the other one -- it
> seems an overlapping/muddling of the purpose of these.
> I think it will be easier to understand if the calling code just calls to one or both
> of these functions as required.

Same as Amit, I didn't change this.

>
> ~~~
>
> 10.
>
> - WalSndWait(wakeEvents, sleeptime,
> WAIT_EVENT_WAL_SENDER_WAIT_FOR_WAL);
> + WalSndWait(wakeEvents, sleeptime, wait_event);
>
> Tracing the assignments of the 'wait_event' is a bit tricky... IIUC it should not be
> 0 when we got to this point, so maybe it is better to put Assert(wait_event)
> before this call?

Added.

Best Regards,
Hou zj

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Zhijie Hou (Fujitsu) 2024-03-03 07:56:32 RE: Synchronizing slots from primary to standby
Previous Message Andy Fan 2024-03-03 07:01:23 a wrong index choose when statistics is out of date