RE: Synchronizing slots from primary to standby

From: "Zhijie Hou (Fujitsu)" <houzj(dot)fnst(at)fujitsu(dot)com>
To: Dilip Kumar <dilipbalaut(at)gmail(dot)com>
Cc: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>, shveta malik <shveta(dot)malik(at)gmail(dot)com>, Bertrand Drouvot <bertranddrouvot(dot)pg(at)gmail(dot)com>, Peter Smith <smithpb2250(at)gmail(dot)com>, Nisha Moond <nisha(dot)moond412(at)gmail(dot)com>, "Hayato Kuroda (Fujitsu)" <kuroda(dot)hayato(at)fujitsu(dot)com>, Bharath Rupireddy <bharath(dot)rupireddyforpostgres(at)gmail(dot)com>, Peter Eisentraut <peter(dot)eisentraut(at)enterprisedb(dot)com>, Bruce Momjian <bruce(at)momjian(dot)us>, Ashutosh Sharma <ashu(dot)coek88(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, Ajin Cherian <itsajin(at)gmail(dot)com>, Alvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org>
Subject: RE: Synchronizing slots from primary to standby
Date: 2024-01-10 12:23:14
Message-ID: OS0PR01MB57163DE4DDCDFA9B75DCCBC794692@OS0PR01MB5716.jpnprd01.prod.outlook.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wednesday, January 10, 2024 2:26 PM Dilip Kumar <dilipbalaut(at)gmail(dot)com> wrote:
>
> On Tue, Jan 9, 2024 at 5:44 PM Zhijie Hou (Fujitsu) <houzj(dot)fnst(at)fujitsu(dot)com>
> wrote:
> >
> comments on 0002

Thanks for the comments !

>
> 1.
> +/* Worker's nap time in case of regular activity on the primary server */
> +#define WORKER_DEFAULT_NAPTIME_MS 10L /* 10 ms */
> +
> +/* Worker's nap time in case of no-activity on the primary server */
> +#define WORKER_INACTIVITY_NAPTIME_MS 10000L /* 10 sec
> */
>
> Instead of directly switching between 10ms to 10s shouldn't we increase the
> nap time gradually? I mean it can go beyond 10 sec as well but instead of
> directly switching from 10ms to 10 sec we can increase it every time with some
> multiplier and keep a max limit up to which it can grow. Although we can reset
> back to 10ms directly as soon as we observe some activity.

Agreed. I changed the strategy similar to what we do in the walsummarizer.

>
> 2.
> SlotSyncWorkerCtxStruct add this to typedefs. list file
>
> 3.
> +/*
> + * Update local slot metadata as per remote_slot's positions */ static
> +void local_slot_update(RemoteSlot *remote_slot) {
> +Assert(MyReplicationSlot->data.invalidated == RS_INVAL_NONE);
> +
> + LogicalConfirmReceivedLocation(remote_slot->confirmed_lsn);
> + LogicalIncreaseXminForSlot(remote_slot->confirmed_lsn,
> + remote_slot->catalog_xmin);
> + LogicalIncreaseRestartDecodingForSlot(remote_slot->confirmed_lsn,
> + remote_slot->restart_lsn);
> +}
>
> IIUC on the standby we just want to overwrite what we get from primary no? If
> so why we are using those APIs that are meant for the actual decoding slots
> where it needs to take certain logical decisions instead of mere overwriting?

I think we don't have a strong reason to use these APIs, but it was convenient to
use these APIs as they can take care of updating the slots info and will call
functions like, ReplicationSlotsComputeRequiredXmin,
ReplicationSlotsComputeRequiredLSN internally. Or do you prefer directly overwriting
the fields and call these manually ?

>
> 4.
> +/*
> + * Helper function for drop_obsolete_slots()
> + *
> + * Drops synced slot identified by the passed in name.
> + */
> +static void
> +drop_synced_slots_internal(const char *name, bool nowait)
>
> Suggestion to add one line to explain no wait in the header

The 'nowait' flag is not necessary now, so removed.

>
> 5.
> +/*
> + * Helper function to check if local_slot is present in remote_slots list.
> + *
> + * It also checks if logical slot is locally invalidated i.e.
> +invalidated on
> + * the standby but valid on the primary server. If found so, it sets
> + * locally_invalidated to true.
> + */
>
> Instead of saying "but valid on the primary server" better to mention it in the
> remote_slots list, because here this function is just checking the remote_slots
> list regardless of whether the list came from. Mentioning primary seems like it
> might fetch directly from the primary in this function so this is a bit confusing.

Adjusted.

>
> 6.
> +/*
> + * Check that all necessary GUCs for slot synchronization are set
> + * appropriately. If not, raise an ERROR.
> + */
> +static void
> +validate_slotsync_parameters(char **dbname)
>
>
> The function name just says 'validate_slotsync_parameters' but it also gets the
> dbname so I think it better we change the name accordingly also instead of
> passing dbname as a parameter just return it directly.
> There
> is no need to pass this extra parameter and make the function return void.

Renamed.

>
> 7.
> + tupslot = MakeSingleTupleTableSlot(res->tupledesc,
> + &TTSOpsMinimalTuple); tuple_ok =
> + tuplestore_gettupleslot(res->tuplestore, true, false, tupslot);
> + Assert(tuple_ok); /* It must return one tuple */
>
> Comments say 'It must return one tuple' but asserting just for at least one tuple
> shouldn't we enhance assert so that it checks that we got exactly one tuple?

Changed to use tuplestore_tuple_count.

>
> 8.
> /* No need to check further, return that we are cascading standby */
> + *am_cascading_standby = true;
>
> we are not returning immediately we are just setting am_cascading_standby to
> true so adjust comments accordingly

Adjusted.

>
> 9.
> + /* No need to check further, return that we are cascading standby */
> + *am_cascading_standby = true; } else {
> + /* We are a normal standby. */
>
> Single-line comments do not follow the uniform pattern for the full stop, either
> use a full stop for all single-line comments or none, at least follow the same rule
> in a file or nearby comments.

Adjusted.

>
> 10.
> + errmsg("exiting from slot synchronization due to bad configuration"),
> + errhint("%s must be defined.", "primary_slot_name"));
>
> Why we are using the constant string "primary_slot_name" as a variable in this
> error formatting?

It was suggested to make it friendly to the translator, as the GUC doesn't needs to be translated
and it can avoid adding multiple similar message to be translated.

>
> 11.
> + /*
> + * Hot_standby_feedback must be enabled to cooperate with the physical
> + * replication slot, which allows informing the primary about the xmin
> + and
> + * catalog_xmin values on the standby.
>
> I do not like capitalizing the first letter of the 'hot_standby_feedback' which is a
> GUC parameter

Changed.

Here is the V59 patch set which addressed above comments and comments from Peter[1].

[1] https://www.postgresql.org/message-id/CAHut%2BPu34_dYj9MnV6n3cPsssEx57YaO6Pg0d9mDryQZX2Mx3g%40mail.gmail.com

Best Regards,
Hou zj

Attachment Content-Type Size
v59-0001-Enable-setting-failover-property-for-a-slot-thro.patch application/octet-stream 100.2 KB
v59-0002-Add-logical-slot-sync-capability-to-the-physical.patch application/octet-stream 82.5 KB
v59-0003-Allow-logical-walsenders-to-wait-for-the-physica.patch application/octet-stream 39.7 KB
v59-0004-Non-replication-connection-and-app_name-change.patch application/octet-stream 9.5 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Amit Kapila 2024-01-10 12:32:12 Re: A recent message added to pg_upgade
Previous Message Aleksander Alekseev 2024-01-10 12:15:57 Re: Compile warnings in dbcommands.c building with meson