Re: Synchronizing slots from primary to standby

From: Peter Smith <smithpb2250(at)gmail(dot)com>
To: shveta malik <shveta(dot)malik(at)gmail(dot)com>
Cc: Ajin Cherian <itsajin(at)gmail(dot)com>, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Bertrand Drouvot <bertranddrouvot(dot)pg(at)gmail(dot)com>, Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>, "Zhijie Hou (Fujitsu)" <houzj(dot)fnst(at)fujitsu(dot)com>, Dilip Kumar <dilipbalaut(at)gmail(dot)com>, Nisha Moond <nisha(dot)moond412(at)gmail(dot)com>, "Hayato Kuroda (Fujitsu)" <kuroda(dot)hayato(at)fujitsu(dot)com>, Bharath Rupireddy <bharath(dot)rupireddyforpostgres(at)gmail(dot)com>, Peter Eisentraut <peter(dot)eisentraut(at)enterprisedb(dot)com>, Bruce Momjian <bruce(at)momjian(dot)us>, Ashutosh Sharma <ashu(dot)coek88(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, Alvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org>
Subject: Re: Synchronizing slots from primary to standby
Date: 2024-01-24 03:21:53
Message-ID: CAHut+Pu_uK==M+VmCMug7m7O6LAwpC05A=T7zP8c4G2-hS+bdg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Here are some comments for patch v66-0001.

======
doc/src/sgml/catalogs.sgml

1.
+ <para>
+ If true, the associated replication slots (i.e. the main slot and the
+ table sync slots) in the upstream database are enabled to be
+ synchronized to the physical standbys
+ </para></entry>

/physical standbys/physical standby/

I wondered if it is better just to say singular "standby" instead of
"standbys" in places like this; e.g. plural might imply cascading for
some readers.

There are a number of examples like this, so I've repeated the same
comment multiple times below. If you disagree, please just ignore all
of them.

======
doc/src/sgml/func.sgml

2.
that the decoding of prepared transactions is enabled for this
- slot. A call to this function has the same effect as the replication
- protocol command <literal>CREATE_REPLICATION_SLOT ...
LOGICAL</literal>.
+ slot. The optional fifth parameter,
+ <parameter>failover</parameter>, when set to true,
+ specifies that this slot is enabled to be synced to the
+ physical standbys so that logical replication can be resumed
+ after failover. A call to this function has the same effect as
+ the replication protocol command
+ <literal>CREATE_REPLICATION_SLOT ... LOGICAL</literal>.
</para></entry>

(same as above)

/physical standbys/physical standby/

Also, I don't see anything else on this page using plural "standbys".

======
doc/src/sgml/protocol.sgml

3. CREATE_REPLICATION_SLOT

+ <varlistentry>
+ <term><literal>FAILOVER [ <replaceable
class="parameter">boolean</replaceable> ]</literal></term>
+ <listitem>
+ <para>
+ If true, the slot is enabled to be synced to the physical
+ standbys so that logical replication can be resumed after failover.
+ The default is false.
+ </para>
+ </listitem>
+ </varlistentry>

(same as above)

/physical standbys/physical standby/

~~~

4. ALTER_REPLICATION_SLOT

+ <variablelist>
+ <varlistentry>
+ <term><literal>FAILOVER [ <replaceable
class="parameter">boolean</replaceable> ]</literal></term>
+ <listitem>
+ <para>
+ If true, the slot is enabled to be synced to the physical
+ standbys so that logical replication can be resumed after failover.
+ </para>
+ </listitem>
+ </varlistentry>
+ </variablelist>

(same as above)

/physical standbys/physical standby/

======
doc/src/sgml/ref/create_subscription.sgml

5.
+ <varlistentry id="sql-createsubscription-params-with-failover">
+ <term><literal>failover</literal> (<type>boolean</type>)</term>
+ <listitem>
+ <para>
+ Specifies whether the replication slots associated with the
subscription
+ are enabled to be synced to the physical standbys so that logical
+ replication can be resumed from the new primary after failover.
+ The default is <literal>false</literal>.
+ </para>
+ </listitem>
+ </varlistentry>

(same as above)

/physical standbys/physical standby/

======
doc/src/sgml/system-views.sgml

6.
+
+ <row>
+ <entry role="catalog_table_entry"><para role="column_definition">
+ <structfield>failover</structfield> <type>bool</type>
+ </para>
+ <para>
+ True if this is a logical slot enabled to be synced to the physical
+ standbys so that logical replication can be resumed from the new primary
+ after failover. Always false for physical slots.
+ </para></entry>
+ </row>

(same as above)

/physical standbys/physical standby/

======
src/backend/commands/subscriptioncmds.c

7.
+ if (IsSet(opts.specified_opts, SUBOPT_FAILOVER))
+ {
+ if (!sub->slotname)
+ ereport(ERROR,
+ (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ errmsg("cannot set failover for a subscription that does not have a
slot name")));
+
+ /*
+ * Do not allow changing the failover state if the
+ * subscription is enabled. This is because the failover
+ * state of the slot on the publisher cannot be modified if
+ * the slot is currently acquired by the apply worker.
+ */
+ if (sub->enabled)
+ ereport(ERROR,
+ (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ errmsg("cannot set %s for enabled subscription",
+ "failover")));
+
+ values[Anum_pg_subscription_subfailover - 1] =
+ BoolGetDatum(opts.failover);
+ replaces[Anum_pg_subscription_subfailover - 1] = true;
+ }

The first message is not consistent with the second. The "failover"
option maybe should be extracted so it won't be translated.

SUGGESTION
errmsg("cannot set %s for a subscription that does not have a slot
name", "failover")

~~~

8. AlterSubscription

+ if (!wrconn)
+ ereport(ERROR,
+ (errcode(ERRCODE_CONNECTION_FAILURE),
+ errmsg("could not connect to the publisher: %s", err)));
+

Need to keep an eye on the patch proposed by Nisha [1] for messages
similar to this one, so in case that gets pushed this code should be
changed appropriately.

======
src/backend/replication/slot.c

9.
* during getting changes, if the two_phase option is enabled it can skip
* prepare because by that time start decoding point has been moved. So the
* user will only get commit prepared.
+ * failover: If enabled, allows the slot to be synced to physical standbys so
+ * that logical replication can be resumed after failover.
*/

(same as earlier)

/physical standbys/physical standby/

~~~

10.
+ /*
+ * Do not allow users to alter slots to enable failover on the standby
+ * as we do not support sync to the cascading standby.
+ */
+ if (RecoveryInProgress() && failover)
+ ereport(ERROR,
+ errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+ errmsg("cannot alter replication slot to have failover"
+ " enabled on the standby"));

I felt the errmsg could be expressed with less ambiguity:

SUGGESTION:
cannot enable failover for a replication slot on the standby

======
src/backend/replication/slotfuncs.c

11. create_physical_replication_slot

/* acquire replication slot, this will check for conflicting names */
ReplicationSlotCreate(name, false,
- temporary ? RS_TEMPORARY : RS_PERSISTENT, false);
+ temporary ? RS_TEMPORARY : RS_PERSISTENT, false,
+ false);

Having an inline comment might be helpful here instead of passing "false,false"

SUGGESTION
ReplicationSlotCreate(name, false,
temporary ? RS_TEMPORARY : RS_PERSISTENT, false,
false /* failover */);

~~~

12. create_logical_replication_slot

+ /*
+ * Do not allow users to create the slots with failover enabled on the
+ * standby as we do not support sync to the cascading standby.
+ */
+ if (RecoveryInProgress() && failover)
+ ereport(ERROR,
+ errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+ errmsg("cannot create replication slot with failover"
+ " enabled on the standby"));

(similar to previous comment)

SUGGESTION:
cannot enable failover for a replication slot created on the standby

~~~

13. copy_replication_slot

* hence pass find_startpoint false. confirmed_flush will be set
* below, by copying from the source slot.
+ *
+ * To avoid potential issues with the slotsync worker when the
+ * restart_lsn of a replication slot goes backwards, we set the
+ * failover option to false here. This situation occurs when a slot on
+ * the primary server is dropped and immediately replaced with a new
+ * slot of the same name, created by copying from another existing
+ * slot. However, the slotsync worker will only observe the restart_lsn
+ * of the same slot going backwards.
*/
create_logical_replication_slot(NameStr(*dst_name),
plugin,
temporary,
false,
+ false,
src_restart_lsn,
false);
(similar to an earlier comment)

Having an inline comment might be helpful here.

e.g. false /* failover */,

======
src/backend/replication/walreceiver.c

14.
- walrcv_create_slot(wrconn, slotname, true, false, 0, NULL);
+ walrcv_create_slot(wrconn, slotname, true, false, false, 0, NULL);

(similar to an earlier comment)

Having an inline comment might be helpful here:

SUGGESTION
walrcv_create_slot(wrconn, slotname, true, false, false /* failover
*/, 0, NULL);

======
src/backend/replication/walsender.c

15. CreateReplicationSlot

ReplicationSlotCreate(cmd->slotname, false,
cmd->temporary ? RS_TEMPORARY : RS_PERSISTENT,
- false);
+ false, false);

(similar to an earlier comment)

Having an inline comment might be helpful here.

e.g. false /* failover */,

~~~

16. CreateReplicationSlot

+ /*
+ * Do not allow users to create the slots with failover enabled on the
+ * standby as we do not support sync to the cascading standby.
+ */
+ if (RecoveryInProgress() && failover)
+ ereport(ERROR,
+ errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+ errmsg("cannot create replication slot with failover"
+ " enabled on the standby"));
+
/*
* Initially create persistent slot as ephemeral - that allows us to
* nicely handle errors during initialization because it'll get
@@ -1243,7 +1265,7 @@ CreateReplicationSlot(CreateReplicationSlotCmd *cmd)
*/
ReplicationSlotCreate(cmd->slotname, true,
cmd->temporary ? RS_TEMPORARY : RS_EPHEMERAL,
- two_phase);
+ two_phase, failover);

This errmsg seems to be repeated in a few places, so I wondered if
this code can be refactored to call direct to
create_logical_replication_slot() so the errmsg can be just once in a
common place.

OTOH, if it cannot be refactored, then needs to be using same errmsg
as suggested by earlier review comments (see above).

======
src/include/catalog/pg_subscription.h

17.
+ bool subfailover; /* True if the associated replication slots
+ * (i.e. the main slot and the table sync
+ * slots) in the upstream database are enabled
+ * to be synchronized to the physical
+ * standbys. */

(same as earlier)

/physical standbys/physical standby/

~~~

18.
+ bool failover; /* True if the associated replication slots
+ * (i.e. the main slot and the table sync
+ * slots) in the upstream database are enabled
+ * to be synchronized to the physical
+ * standbys. */

(same as earlier)

/physical standbys/physical standby/

======
src/include/replication/slot.h

19.
+
+ /*
+ * Is this a failover slot (sync candidate for physical standbys)? Only
+ * relevant for logical slots on the primary server.
+ */
+ bool failover;

(same as earlier)

/physical standbys/physical standby/

======
[1] Nisha errmsg -
https://www.postgresql.org/message-id/CABdArM5-VR4Akt_AHap_0Ofne0cTcsdnN6FcNe%2BMU8eXsa_ERQ%40mail.gmail.com

Kind Regards,
Peter Smith.
Fujitsu Australia

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Amit Kapila 2024-01-24 04:09:01 Re: Synchronizing slots from primary to standby
Previous Message Bharath Rupireddy 2024-01-24 02:21:38 Re: Add code indentation check to cirrus-ci (was Re: Add BF member koel-like indentation checks to SanityCheck CI)