Re: Synchronizing slots from primary to standby

From: shveta malik <shveta(dot)malik(at)gmail(dot)com>
To: Peter Smith <smithpb2250(at)gmail(dot)com>
Cc: "Drouvot, Bertrand" <bertranddrouvot(dot)pg(at)gmail(dot)com>, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, "Hayato Kuroda (Fujitsu)" <kuroda(dot)hayato(at)fujitsu(dot)com>, Bharath Rupireddy <bharath(dot)rupireddyforpostgres(at)gmail(dot)com>, Peter Eisentraut <peter(dot)eisentraut(at)enterprisedb(dot)com>, Bruce Momjian <bruce(at)momjian(dot)us>, Ashutosh Sharma <ashu(dot)coek88(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, Ajin Cherian <itsajin(at)gmail(dot)com>, Alvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org>, shveta malik <shveta(dot)malik(at)gmail(dot)com>
Subject: Re: Synchronizing slots from primary to standby
Date: 2023-10-12 03:53:14
Message-ID: CAJpy0uCRykgTBwu9tTTdxxcM-r9FAerAO4yJgWZc7MFb-06gSw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, Oct 10, 2023 at 12:52 PM Peter Smith <smithpb2250(at)gmail(dot)com> wrote:
>
> On Mon, Oct 9, 2023 at 9:34 PM shveta malik <shveta(dot)malik(at)gmail(dot)com> wrote:
> >
> > On Wed, Oct 4, 2023 at 8:53 AM Peter Smith <smithpb2250(at)gmail(dot)com> wrote:
> > >
> > > Here are some review comments for v20-0002.
> > >
> >
> > Thanks Peter for the feedback. Comments from 31 till end are addressed
> > in v22. First 30 comments will be addressed in the next version.
> >
>
> Thanks for addressing my previous comments.
>
> I checked those and went through other changes in v22-0002 to give a
> few more review comments below.
>

Thank You for your feedback. I have addressed these in v23.

> I understand there are some design changes coming soon regarding the
> use of GUCs so maybe a few of these comments will become redundant.
>
> ======
> doc/src/sgml/config.sgml
>
> 1.
> A password needs to be provided too, if the sender demands password
> authentication. It can be provided in the
> <varname>primary_conninfo</varname> string, or in a separate
> - <filename>~/.pgpass</filename> file on the standby server (use
> - <literal>replication</literal> as the database name).
> - Do not specify a database name in the
> - <varname>primary_conninfo</varname> string.
> + <filename>~/.pgpass</filename> file on the standby server.
> + </para>
> + <para>
> + Specify a database name in <varname>primary_conninfo</varname> string
> + to allow synchronization of slots from the primary to standby. This
> + dbname will only be used for slots synchronization purpose and will
> + be irrelevant for streaming.
> </para>
>
> 1a.
> "Specify a database name in...". Shouldn't that say "Specify dbname in..."?
>
> ~
>
> 1b.
> BEFORE
> This dbname will only be used for slots synchronization purpose and
> will be irrelevant for streaming.
>
> SUGGESTION
> This will only be used for slot synchronization. It is ignored for streaming.
>
> ======
> doc/src/sgml/system-views.sgml
>
> 2. pg_replication_slots
>
> + <row>
> + <entry role="catalog_table_entry"><para role="column_definition">
> + <structfield>synced_slot</structfield> <type>bool</type>
> + </para>
> + <para>
> + True if this logical slot is created on physical standby as part of
> + slot-synchronization from primary server. Always false for
> physical slots.
> + </para></entry>
> + </row>
>
> /on physical standby/on the physical standby/
>
> /from primary server/from the primary server/
>
> ======
> src/backend/replication/logical/launcher.c
>
> 3. LaunchSlotSyncWorkers
>
> + /*
> + * If we failed to launch this slotsync worker, return and try
> + * launching rest of the workers in next sync cycle. But change
> + * launcher's wait time to minimum of wal_retrieve_retry_interval and
> + * default wait time to try next sync-cycle sooner.
> + */
>
> 3a.
> Use consistent terms -- choose "sync cycle" or "sync-cycle"
>
> ~
>
> 3b.
> Is it correct to just say "rest of the workers"; won't it also try to
> relaunch this same failed worker again?
>
> ~~~
>
> 4. LauncherMain
>
> + /*
> + * Stop the slot-sync workers if any of the related GUCs changed.
> + * These will be relaunched using the new values during next
> + * sync-cycle. Also revalidate the new configurations and
> + * reconnect.
> + */
> + if (SlotSyncConfigsChanged())
> + {
> + slotsync_workers_stop();
> +
> + if (wrconn)
> + walrcv_disconnect(wrconn);
> +
> + if (RecoveryInProgress())
> + wrconn = slotsync_remote_connect();
> + }
>
> Was it overkill to disconnect/reconnect every time any of those GUCs
> changed? Or is it enough to do that only if the
> PrimaryConnInfoPreReload was changed?
>
> ======
> src/backend/replication/logical/logical.c
>
> 5. CreateDecodingContext
>
> + /*
> + * Do not allow consumption of a "synchronized" slot until the standby
> + * gets promoted.
> + */
> + if (RecoveryInProgress() && slot->data.synced)
> + ereport(ERROR,
> + (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
> + errmsg("cannot use replication slot \"%s\" for logical decoding",
> + NameStr(slot->data.name)),
> + errdetail("This slot is being synced from primary."),
> + errhint("Specify another replication slot.")));
> +
>
> /from primary/from the primary/
>
> ======
> src/backend/replication/logical/slotsync.c
>
> 6. use_slot_in_query
>
> + /*
> + * Return TRUE if either slot is not yet created on standby or if it
> + * belongs to one of the dbs passed in dbids.
> + */
> + if (!slot_found || relevant_db)
> + return true;
> +
> + return false;
>
> Same as single line:
>
> return (!slot_found || relevant_db);
>
> ~~~
>
> 7. synchronize_one_slot
>
> + /*
> + * If the local restart_lsn and/or local catalog_xmin is ahead of
> + * those on the remote then we cannot create the local slot in sync
> + * with primary because that would mean moving local slot backwards
> + * and we might not have WALs retained for old lsns. In this case we
> + * will wait for primary's restart_lsn and catalog_xmin to catch up
> + * with the local one before attempting the sync.
> + */
>
> /moving local slot/moving the local slot/
>
> /with primary/with the primary/
>
> /wait for primary's/wait for the primary's/
>
> ~~~
>
> 8. ProcessSlotSyncInterrupts
>
> + if (ConfigReloadPending)
> + {
> + ConfigReloadPending = false;
> +
> + /* Save the PrimaryConnInfo before reloading */
> + *conninfo_prev = pstrdup(PrimaryConnInfo);
>
> If the configuration keeps changing then there might be a slow leak
> here because I didn't notice anywhere where this strdup'ed string is
> getting freed. Is that something worth worrying about?
>
> ======
> src/backend/replication/slot.c
>
> 9. ReplicationSlotDrop
>
> + /*
> + * Do not allow users to drop the slots which are currently being synced
> + * from the primary to standby.
> + */
> + if (user_cmd && RecoveryInProgress() && MyReplicationSlot->data.synced)
> + {
> + ReplicationSlotRelease();
> + ereport(ERROR,
> + (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
> + errmsg("cannot drop replication slot"),
> + errdetail("This slot is being synced from primary.")));
> + }
> +
>
> 9a.
> /to standby/to the standby/
>
> ~
>
> 9b.
> Shouldn't the errmsg name the slot? Otherwise, the message might not
> be so useful.
>
> ~
>
> 9c.
> /synced from primary/synced from the primary/
>
> ======
> src/backend/replication/walsender.c
>
>
> 10. ListSlotDatabaseOIDs
>
> + LWLockAcquire(ReplicationSlotControlLock, LW_SHARED);
> + for (slotno = 0; slotno < max_replication_slots; slotno++)
> + {
> + ReplicationSlot *slot = &ReplicationSlotCtl->replication_slots[slotno];
>
> This is all new code so you can use C99 for loop variable declaration here.
>
> ~~~
>
> 11.
> + /* If synchronize_slot_names is '*', then skip physical slots */
> + if (SlotIsPhysical(slot))
> + continue;
> +
>
>
> Some mental gymnastics are needed to understand how this code means "
> synchronize_slot_names is '*'".
>
> IMO it would be easier to understand if the previous "if
> (numslot_names)" was rewritten as if/else.
>
> ======
> .../utils/activity/wait_event_names.txt
>
> 12.
> RECOVERY_WAL_STREAM "Waiting in main loop of startup process for WAL
> to arrive, during streaming recovery."
> +REPL_SLOTSYNC_MAIN "Waiting in main loop of worker for synchronizing
> slots to a standby from primary."
> +REPL_SLOTSYNC_PRIMARY_CATCHP "Waiting for primary to catch-up in
> worker for synchronizing slots to a standby from primary."
> SYSLOGGER_MAIN "Waiting in main loop of syslogger process."
>
> 12a.
> Maybe those descriptions can be simplified a bit?
>
> SUGGESTION
> REPL_SLOTSYNC_MAIN "Waiting in the main loop of slot-sync worker."
> REPL_SLOTSYNC_PRIMARY_CATCHP "Waiting for the primary to catch up, in
> slot-sync worker."
>
> ~
>
> 12b.
> typo?
>
> /REPL_SLOTSYNC_PRIMARY_CATCHP/REPL_SLOTSYNC_PRIMARY_CATCHUP/
>
> ======
> src/include/replication/walreceiver.h
>
> 13. WalRcvRepSlotDbData
>
> +/*
> + * Slot's DBid related data
> + */
> +typedef struct WalRcvRepSlotDbData
> +{
> + Oid database; /* Slot's DBid received from remote */
> +} WalRcvRepSlotDbData;
>
> Just calling this new field 'database' seems odd. Searching PG src I
> found typical fields/variables like this one are called 'databaseid',
> or 'dboid', or 'dbid', or 'db_id' etc.
>
> ======
> Kind Regards,
> Peter Smith.
> Fujitsu Australia

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message jian he 2023-10-12 03:53:53 Re: [PATCH] Extend ALTER OPERATOR to support adding commutator, negator, hashes, and merges
Previous Message shveta malik 2023-10-12 03:48:51 Re: Synchronizing slots from primary to standby