Re: Synchronizing slots from primary to standby

From: shveta malik <shveta(dot)malik(at)gmail(dot)com>
To: "Zhijie Hou (Fujitsu)" <houzj(dot)fnst(at)fujitsu(dot)com>
Cc: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, "Drouvot, Bertrand" <bertranddrouvot(dot)pg(at)gmail(dot)com>, Peter Smith <smithpb2250(at)gmail(dot)com>, "Hayato Kuroda (Fujitsu)" <kuroda(dot)hayato(at)fujitsu(dot)com>, Bharath Rupireddy <bharath(dot)rupireddyforpostgres(at)gmail(dot)com>, Peter Eisentraut <peter(dot)eisentraut(at)enterprisedb(dot)com>, Bruce Momjian <bruce(at)momjian(dot)us>, Ashutosh Sharma <ashu(dot)coek88(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, Ajin Cherian <itsajin(at)gmail(dot)com>, Alvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org>, shveta malik <shveta(dot)malik(at)gmail(dot)com>
Subject: Re: Synchronizing slots from primary to standby
Date: 2023-11-06 04:09:58
Message-ID: CAJpy0uDpV0suPbhCp+1aRLXEChD9uKp-ffBW_HfZro=53JKK5w@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Mon, Nov 6, 2023 at 7:01 AM Zhijie Hou (Fujitsu)
<houzj(dot)fnst(at)fujitsu(dot)com> wrote:
>
> On Friday, November 3, 2023 7:32 PM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
> >
> > On Thu, Nov 2, 2023 at 2:35 PM Zhijie Hou (Fujitsu) <houzj(dot)fnst(at)fujitsu(dot)com>
> > wrote:
> > >
> > > Here is the new version patch set(V29) which addressed Peter
> > > comments[1][2] and fixed one doc compile error.
> > >
> >
> > Few comments:
> > ==============
> > 1.
> > + <varlistentry id="sql-createsubscription-params-with-failover">
> > + <term><literal>failover</literal> (<type>boolean</type>)</term>
> > + <listitem>
> > + <para>
> > + Specifies whether the replication slot assocaited with the
> > subscription
> > + is enabled to be synced to the physical standbys so that logical
> > + replication can be resumed from the new primary after failover.
> > + The default is <literal>true</literal>.
> >
> > Why do you think it is a good idea to keep the default value as true?
> > I think the user needs to enable standby for syncing slots which is not a default
> > feature, so by default, the failover property should also be false. AFAICS, it is
> > false for create_slot SQL API as per the below change; so that way also keeping
> > default true for a subscription doesn't make sense.
> > @@ -479,6 +479,7 @@ CREATE OR REPLACE FUNCTION
> > pg_create_logical_replication_slot(
> > IN slot_name name, IN plugin name,
> > IN temporary boolean DEFAULT false,
> > IN twophase boolean DEFAULT false,
> > + IN failover boolean DEFAULT false,
> > OUT slot_name name, OUT lsn pg_lsn)
> >
> > BTW, the below change indicates that the code treats default as false; so, it
> > seems to be a documentation error.
>
> I think the document is wrong and fixed it.
>
> >
> > 2.
> > -
> > /*
> > * Common option parsing function for CREATE and ALTER SUBSCRIPTION
> > commands.
> > *
> >
> > Spurious line removal.
> >
> > 3.
> > + else if (opts.slot_name && failover_enabled) {
> > + walrcv_alter_slot(wrconn, opts.slot_name, opts.failover);
> > + ereport(NOTICE, (errmsg("altered replication slot \"%s\" on
> > + publisher", opts.slot_name))); }
> >
> > I think we can add a comment to describe why it makes sense to enable the
> > failover property of the slot in this case. Can we change the notice message to:
> > "enabled failover for replication slot \"%s\" on publisher"
>
> Added.
>
> >
> > 4.
> > libpqrcv_create_slot(WalReceiverConn *conn, const char *slotname,
> > - bool temporary, bool two_phase, CRSSnapshotAction snapshot_action,
> > - XLogRecPtr *lsn)
> > + bool temporary, bool two_phase, bool failover, CRSSnapshotAction
> > + snapshot_action, XLogRecPtr *lsn)
> > {
> > PGresult *res;
> > StringInfoData cmd;
> > @@ -913,7 +917,14 @@ libpqrcv_create_slot(WalReceiverConn *conn, const
> > char *slotname,
> > else
> > appendStringInfoChar(&cmd, ' ');
> > }
> > -
> > + if (failover)
> > + {
> > + appendStringInfoString(&cmd, "FAILOVER"); if (use_new_options_syntax)
> > + appendStringInfoString(&cmd, ", "); else appendStringInfoChar(&cmd, '
> > + '); }
> >
> > I don't see a corresponding change in repl_gram.y. I think the following part of
> > the code needs to be changed:
> > /* CREATE_REPLICATION_SLOT slot [TEMPORARY] LOGICAL plugin [options] */
> > | K_CREATE_REPLICATION_SLOT IDENT opt_temporary K_LOGICAL IDENT
> > create_slot_options
> >
>
> I think after 0266e98, we started to use the new syntax(see the
> generic_option_list rule) and we can avoid changing the repl_gram.y when adding
> new options. The new failover can be detected when parsing the generic option
> list(in parseCreateReplSlotOptions).
>
>
> >
> > 5.
> > @@ -228,6 +230,28 @@ pg_logical_slot_get_changes_guts(FunctionCallInfo
> > fcinfo, bool confirm, bool bin
> > NameStr(MyReplicationSlot->data.plugin),
> > format_procedure(fcinfo->flinfo->fn_oid))));
> > ..
> > + if (XLogRecPtrIsInvalid(upto_lsn))
> > + wal_to_wait = end_of_wal;
> > + else
> > + wal_to_wait = Min(upto_lsn, end_of_wal);
> > +
> > + /* Initialize standby_slot_names_list */ SlotSyncInitConfig();
> > +
> > + /*
> > + * Wait for specified streaming replication standby servers (if any)
> > + * to confirm receipt of WAL upto wal_to_wait.
> > + */
> > + WalSndWaitForStandbyConfirmation(wal_to_wait);
> > +
> > + /*
> > + * The memory context used to allocate standby_slot_names_list will be
> > + * freed at the end of this call. So free and nullify the list in
> > + * order to avoid usage of freed list in the next call to this
> > + * function.
> > + */
> > + SlotSyncFreeConfig();
> >
> > What if there is an error in WalSndWaitForStandbyConfirmation() before calling
> > SlotSyncFreeConfig()? I think the problem you are trying to avoid by freeing it
> > here can occur. I think it is better to do this in a logical decoding context and
> > free the list along with it as we are doing in commit c7256e6564(see PG15).
>
> I will analyze more about this case and update in next version.
>
> > Also,
> > it is better to allocate this list somewhere in
> > WalSndWaitForStandbyConfirmation(), probably in WalSndGetStandbySlots,
> > that will make the code look neat and also avoid allocating this list when
> > failover is not enabled for the slot.
>
> Changed as suggested.
>
>
> >
> > 6.
> > +/* ALTER_REPLICATION_SLOT slot */
> > +alter_replication_slot:
> > + K_ALTER_REPLICATION_SLOT IDENT '(' generic_option_list ')'
> >
> > I think you need to update the docs for this new command. See existing docs
> > [1].
> >
> > [1] - https://www.postgresql.org/docs/devel/protocol-replication.html
>
> I think the doc for alter_replication_slot was added in V29.
>
> Attach the V30 patch set which addressed above comments and fixed CFbot failures.
>

Thanks Hou-San for the patches.

+ /* The primary_slot_name is not set */
+ if (!WalRcv || WalRcv->slotname[0] == '\0')
+ {
+ ereport(WARNING,
+ errmsg("skipping slots synchronization as primary_slot_name "
+ "is not set."));
+
+ /*
+ * It's possible that the Walreceiver has not been started yet, adjust
+ * the wait_time to retry sooner in the next synchronization cycle.
+ */
+ *wait_time = wal_retrieve_retry_interval;
+ return NULL;
+ }

+ if (!RecoveryInProgress())
+ LaunchSubscriptionApplyWorker(&wait_time);
+ else if (wrconn == NULL)
+ wrconn = slotsync_remote_connect(&wait_time);

If primary_slot_name is genuinely missing, then the launcher will keep
on attempting to reconnect and will keep on logging warnings which is
not good.

2023-11-06 09:31:32.206 IST [1032781] WARNING: skipping slots
synchronization as primary_slot_name is not set.
2023-11-06 09:31:37.212 IST [1032781] WARNING: skipping slots
synchronization as primary_slot_name is not set.
2023-11-06 09:31:42.219 IST [1032781] WARNING: skipping slots
synchronization as primary_slot_name is not set.

Same is true for other parameters checked by slotsync_remote_connect,
only the frequency of WARNING msgs will be lesser (after every 3
mins).

Perhaps we should try connecting only once during the start of the
launcher and then after each configReload? In order to take care of
cfbot failure, where the launcher may start before WalReceiver and
thus may not find WalRcv->slotname[0] set in the launcher, we may go
by checking GUC primary_slot_name directly in the launcher? Thoughts?

thanks
Shveta

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Michael Paquier 2023-11-06 05:05:59 Re: Add recovery to pg_control and remove backup_label
Previous Message Dilip Kumar 2023-11-06 04:09:54 Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock