RE: Synchronizing slots from primary to standby

From: "Zhijie Hou (Fujitsu)" <houzj(dot)fnst(at)fujitsu(dot)com>
To: Bertrand Drouvot <bertranddrouvot(dot)pg(at)gmail(dot)com>
Cc: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, shveta malik <shveta(dot)malik(at)gmail(dot)com>, Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>, Peter Smith <smithpb2250(at)gmail(dot)com>, Ajin Cherian <itsajin(at)gmail(dot)com>, Dilip Kumar <dilipbalaut(at)gmail(dot)com>, Nisha Moond <nisha(dot)moond412(at)gmail(dot)com>, "Hayato Kuroda (Fujitsu)" <kuroda(dot)hayato(at)fujitsu(dot)com>, Bharath Rupireddy <bharath(dot)rupireddyforpostgres(at)gmail(dot)com>, Peter Eisentraut <peter(dot)eisentraut(at)enterprisedb(dot)com>, Bruce Momjian <bruce(at)momjian(dot)us>, Ashutosh Sharma <ashu(dot)coek88(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, Alvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org>
Subject: RE: Synchronizing slots from primary to standby
Date: 2024-02-26 02:18:58
Message-ID: OS3PR01MB57184FA4AB9F2E6F1BF802DC945A2@OS3PR01MB5718.jpnprd01.prod.outlook.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Friday, February 23, 2024 6:12 PM Bertrand Drouvot <bertranddrouvot(dot)pg(at)gmail(dot)com> wrote:
>
> Here are some random comments:

Thanks for the comments!

>
> 1 ===
>
> Commit message "Allow logical walsenders to wait for the physical"
>
> s/physical/physical standby/?
>
> 2 ==
>
> +++ b/src/backend/replication/logical/logicalfuncs.c
> @@ -30,6 +30,7 @@
> #include "replication/decode.h"
> #include "replication/logical.h"
> #include "replication/message.h"
> +#include "replication/walsender.h"
>
> Is this include needed?

Removed.

>
> 3 ===
>
> + * Slot sync is currently not supported on the cascading
> + standby. This is
>
> s/on the/on a/?

Changed.

>
> 4 ===
>
> + if (!ok)
> + GUC_check_errdetail("List syntax is invalid.");
> +
> + /*
> + * If there is a syntax error in the name or if the replication slots'
> + * data is not initialized yet (i.e., we are in the startup process), skip
> + * the slot verification.
> + */
> + if (!ok || !ReplicationSlotCtl)
> + {
> + pfree(rawname);
> + list_free(elemlist);
> + return ok;
> + }
>
> we are testing the "ok" value twice, what about using if...else if... instead and
> test it once? If so, it might be worth to put the:
>
> "
> + pfree(rawname);
> + list_free(elemlist);
> + return ok;
> "
>
> in a "goto".

There were comments to remove the 'goto' statement and avoid
duplicate free code, so I prefer the current style.

>
> 5 ===
>
> + * for which all standbys to wait for. Even if we have
> + physical-slots
>
> s/physical-slots/physical slots/?

Changed.

>
> 6 ===
>
> * Switch to the same memory context under which GUC variables are
>
> s/to the same memory/to the memory/?

Changed.

>
> 7 ===
>
> + * Return a copy of standby_slot_names_list if the copy flag is set to
> + true,
>
> Not sure, but would it be worth explaining why one would want to set to flag to
> true or false? (i.e why one would not want to receive the original list).

I think the usage can be found from the caller's code, e.g we need to remove
the slots that caught up from the list each time, so we cannot directly modify
the global list. The GetStandbySlotList function is general function and I feel
we can avoid adding more comments here.

>
> 8 ===
>
> + if (RecoveryInProgress())
> + return NIL;
>
> The need is well documented just above, but are we not violating the fact that
> we return the original list or a copy of it? (that's what the comment above the
> GetStandbySlotList() function definition is saying).
>
> I think the comment above the GetStandbySlotList() function needs a bit of
> rewording to cover that case.

Adjusted.

>
> 9 ===
>
> + * harmless, a WARNING should be enough, no need to
> error-out.
>
> s/error-out/error out/?

Changed.

>
> 10 ===
>
> + if (slot->data.invalidated != RS_INVAL_NONE)
> + {
> + /*
> + * Specified physical slot have been invalidated,
> so no point
> + * in waiting for it.
>
> We discovered in [1], that if the wal_status is "unreserved" then the slot is still
> serving the standby. I think we should handle this case differently, thoughts?

I think the 'invalidated' slot can still be used is a separate bug. Because
once the slot is invalidated, it can neither protect WALs or ROWs from being
removed even if the restart_lsn of the slot can be moved forward after being invalidated.

If the standby can move restart_lsn forward for invalidated slots, then
it should also set the 'invalidated' flag back to NONE, otherwise the slot
cannot serve its purpose anymore. I also reported similar bug before[1].

>
> 11 ===
>
> + * Specified physical slot have been
> + invalidated, so no point
>
> s/have been/has been/?

Changed.

>
> 12 ===
>
> +++ b/src/backend/replication/slotfuncs.c
> @@ -22,6 +22,7 @@
> #include "replication/logical.h"
> #include "replication/slot.h"
> #include "replication/slotsync.h"
> +#include "replication/walsender.h"
>
> Is this include needed?

No, it's not needed. Removed.

Attach the V98 patch set which addressed above comments.
I also adjusted few comments based on off-list comments from Shveta.

The discussion for wait behavior is on-going, so I didn't change the behavior in this version.

[1] https://www.postgresql.org/message-id/flat/OS0PR01MB5716A626A4AF5814E057CEE39484A(at)OS0PR01MB5716(dot)jpnprd01(dot)prod(dot)outlook(dot)com

Best Regards,
Hou zj

Attachment Content-Type Size
v98-0002-Document-the-steps-to-check-if-the-standby-is-re.patch application/octet-stream 7.0 KB
v98-0001-Allow-logical-walsenders-to-wait-for-the-physica.patch application/octet-stream 37.2 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Andrei Lepikhov 2024-02-26 02:52:59 Re: "type with xxxx does not exist" when doing ExecMemoize()
Previous Message Andy Fan 2024-02-26 02:04:14 Re: Shared detoast Datum proposal