Re: Synchronizing slots from primary to standby

From: Peter Smith <smithpb2250(at)gmail(dot)com>
To: shveta malik <shveta(dot)malik(at)gmail(dot)com>
Cc: Dilip Kumar <dilipbalaut(at)gmail(dot)com>, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, "Drouvot, Bertrand" <bertranddrouvot(dot)pg(at)gmail(dot)com>, Nisha Moond <nisha(dot)moond412(at)gmail(dot)com>, "Zhijie Hou (Fujitsu)" <houzj(dot)fnst(at)fujitsu(dot)com>, "Hayato Kuroda (Fujitsu)" <kuroda(dot)hayato(at)fujitsu(dot)com>, Bharath Rupireddy <bharath(dot)rupireddyforpostgres(at)gmail(dot)com>, Peter Eisentraut <peter(dot)eisentraut(at)enterprisedb(dot)com>, Bruce Momjian <bruce(at)momjian(dot)us>, Ashutosh Sharma <ashu(dot)coek88(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, Ajin Cherian <itsajin(at)gmail(dot)com>, Alvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org>
Subject: Re: Synchronizing slots from primary to standby
Date: 2023-12-13 10:22:54
Message-ID: CAHut+PtOc7J_n24HJ6f_dFWTuD3X2ApOByQzZf6jZz+0wb-ebQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi Shveta, here are some review comments for v45-0002.

======
doc/src/sgml/bgworker.sgml

1.
+ <variablelist>
+ <varlistentry>
+ <term><literal>BgWorkerStart_PostmasterStart</literal></term>
+ <listitem>
+ <para>
+ <indexterm><primary>BgWorkerStart_PostmasterStart</primary></indexterm>
+ Start as soon as postgres itself has finished its own initialization;
+ processes requesting this are not eligible for database connections.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry>
+ <term><literal>BgWorkerStart_ConsistentState</literal></term>
+ <listitem>
+ <para>
+ <indexterm><primary>BgWorkerStart_ConsistentState</primary></indexterm>
+ Start as soon as a consistent state has been reached in a hot-standby,
+ allowing processes to connect to databases and run read-only queries.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry>
+ <term><literal>BgWorkerStart_RecoveryFinished</literal></term>
+ <listitem>
+ <para>
+ <indexterm><primary>BgWorkerStart_RecoveryFinished</primary></indexterm>
+ Start as soon as the system has entered normal read-write state. Note
+ that the <literal>BgWorkerStart_ConsistentState</literal> and
+ <literal>BgWorkerStart_RecoveryFinished</literal> are equivalent
+ in a server that's not a hot standby.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry>
+ <term><literal>BgWorkerStart_ConsistentState_HotStandby</literal></term>
+ <listitem>
+ <para>
+ <indexterm><primary>BgWorkerStart_ConsistentState_HotStandby</primary></indexterm>
+ Same meaning as <literal>BgWorkerStart_ConsistentState</literal> but
+ it is more strict in terms of the server i.e. start the worker only
+ if it is hot-standby.
+ </para>
+ </listitem>
+ </varlistentry>
+ </variablelist>

Maybe reorder these slightly, because I felt it is better if the
BgWorkerStart_ConsistentState_HotStandby comes next after
BgWorkerStart_ConsistentState, which it refers to

For example::
1st.BgWorkerStart_PostmasterStart
2nd.BgWorkerStart_ConsistentState
3rd.BgWorkerStart_ConsistentState_HotStandby
4th.BgWorkerStart_RecoveryFinished

======
doc/src/sgml/config.sgml

2.
<varname>enable_syncslot</varname> = true

Not sure, but I thought the "= true" part should be formatted too.

SUGGESTION
<literal>enable_syncslot = true</literal>

======
doc/src/sgml/logicaldecoding.sgml

3.
+ <para>
+ A logical replication slot on the primary can be synchronized to the hot
+ standby by enabling the failover option during slot creation and setting
+ <varname>enable_syncslot</varname> on the standby. For the synchronization
+ to work, it is mandatory to have a physical replication slot between the
+ primary and the standby. It's highly recommended that the said physical
+ replication slot is listed in <varname>standby_slot_names</varname> on
+ the primary to prevent the subscriber from consuming changes faster than
+ the hot standby. Additionally, <varname>hot_standby_feedback</varname>
+ must be enabled on the standby for the slots synchronization to work.
+ </para>

I felt those parts that describe the mandatory GUCs should be kept together.

SUGGESTION
For the synchronization to work, it is mandatory to have a physical
replication slot between the primary and the standby, and
<varname>hot_standby_feedback</varname> must be enabled on the
standby.

It's also highly recommended that the said physical replication slot
is named in <varname>standby_slot_names</varname> list on the primary,
to prevent the subscriber from consuming changes faster than the hot
standby.

~~~

4. (Chapter 49)

By enabling synchronization of slots, logical replication can be
resumed after failover depending upon the
pg_replication_slots.sync_state for the synchronized slots on the
standby at the time of failover. Only slots that were in ready
sync_state ('r') on the standby before failover can be used for
logical replication after failover. However, the slots which were in
initiated sync_state ('i') and not sync-ready ('r') at the time of
failover will be dropped and logical replication for such slots can
not be resumed after failover. This applies to the case where a
logical subscription is disabled before failover and is enabled after
failover. If the synchronized slot due to disabled subscription could
not be made sync-ready ('r') on standby, then the subscription can not
be resumed after failover even when enabled. If the primary is idle,
then the synchronized slots on the standby may take a noticeable time
to reach the ready ('r') sync_state. This can be sped up by calling
the pg_log_standby_snapshot function on the primary.

~

Somehow, I still felt all that was too wordy/repetitive. Below is my
attempt to make it more concise. Thoughts?

SUGGESTION
The ability to resume logical replication after failover depends upon
the pg_replication_slots.sync_state value for the synchronized slots
on the standby at the time of failover. Only slots that have attained
a "ready" sync_state ('r') on the standby before failover can be used
for logical replication after failover. Slots that have not yet
reached 'r' state (they are still 'i') will be dropped, therefore
logical replication for those slots cannot be resumed. For example, if
the synchronized slot could not become sync-ready on standby due to a
disabled subscription, then the subscription cannot be resumed after
failover even when it is enabled.

If the primary is idle, the synchronized slots on the standby may take
a noticeable time to reach the ready ('r') sync_state. This can be
sped up by calling the pg_log_standby_snapshot function on the
primary.

======
doc/src/sgml/system-views.sgml

5.
+ <para>
+ Defines slot synchronization state. This is meaningful on the physical
+ standby which has configured <varname>enable_syncslot</varname> = true
+ </para>

As mentioned in the previous review comment ([1]#10) I thought it
might be good to include a hyperlink cross-reference to the
'enable_syncslot' GUC.

~~~

6.
+ <para>
+ The hot standby can have any of these sync_state for the slots but on a
+ hot standby, the slots with state 'r' and 'i' can neither be used for
+ logical decoding nor dropped by the user. The primary server will have
+ sync_state as 'n' for all the slots. But if the standby is promoted to
+ become the new primary server, sync_state can be seen 'r' as well. On
+ this new primary server, slots with sync_state as 'r' and 'n' will
+ behave the same.
+ </para></entry>

6a.
/these sync_state for the slots/these sync_state values for the slots/

~

6b
Hm. I still felt (same as previous review [1]#12b) that there seems
too much information here.

IIUC the sync_state is only meaningful on the standby. Sure, it might
have some values line 'n' or 'r' on the primary also, but those either
mean nothing ('n') or are leftover states from a previous failover
from a standby ('r'), which also means nothing. So can't we just say
it more succinctly like that?

SUGGESTION
The sync_state has no meaning on the primary server; the primary
sync_state value is default 'n' for all slots but may (if leftover
from a promoted standby) also be 'r'.

======
.../libpqwalreceiver/libpqwalreceiver.c

7.
static void
libpqrcv_alter_slot(WalReceiverConn *conn, const char *slotname,
- bool failover)
+ bool failover)

Still seems to be tampering with indentation that should only be in patch 0001.

======
src/backend/replication/logical/slotsync.c

8. wait_for_primary_slot_catchup

The meaning of the boolean return of this function is still not
described by the function comment.

~~~

9.
+ * If passed, *wait_attempts_exceeded will be set to true only if this
+ * function exits after exhausting its wait attempts. It will be false
+ * in all the other cases like failure, remote-slot invalidation, primary
+ * could catch up.

The above already says when a return false happens, so it seems
overkill to give more information.

SUGGESTION
If passed, *wait_attempts_exceeded will be set to true only if this
function exits due to exhausting its wait attempts. It will be false
in all the other cases.

~~~

10.

+static bool
+wait_for_primary_slot_catchup(WalReceiverConn *wrconn, RemoteSlot *remote_slot,
+ bool *wait_attempts_exceeded)
+{
+#define WAIT_OUTPUT_COLUMN_COUNT 4
+#define WORKER_PRIMARY_CATCHUP_WAIT_ATTEMPTS 5
+

10a
Maybe the long constant name is too long. How about
WAIT_PRIMARY_CATCHUP_ATTEMPTS?

~~~

10b.
IMO it is better to Assert the input value of this kind of side-effect
return parameter, to give a better understanding and to prevent future
accidents.

SUGGESTION
Assert(wait_attempts_exceeded == NULL |} *wait_attempts_exceeded == false);

~~~

11. synchronize_one_slot

+ ReplicationSlot *s;
+ ReplicationSlot *slot;
+ char sync_state = '\0';

11a.
I don't think you need both 's' and 'slot' ReplicationSlot -- it looks
a bit odd. Can't you just reuse the one 'slot' variable?

~

11b.
Also, maybe those assignment like
+ slot = MyReplicationSlot;

can have an explanatory comment like:
/* For convenience, we assign MyReplicationSlot to a shorter variable name. */

~~~

12.
+static void
+synchronize_one_slot(WalReceiverConn *wrconn, RemoteSlot *remote_slot,
+ bool *slot_updated)
+{
+ ReplicationSlot *s;
+ ReplicationSlot *slot;
+ char sync_state = '\0';

In my previous review [1]#33a I thought it was strange to assign the
sync_state (which is essentially an enum) to some meaningless value,
so I suggested it should be set to SYNCSLOT_STATE_NONE in the
declaration. The reply [2] was "No, that will change the flow. It
should stay uninitialized if the slot is not found."

But I am not convinced there is any flow problem. Also,
SYNCSLOT_STATE_NONE seems the naturally correct default for something
with no state. It cannot be found and be SYNCSLOT_STATE_NONE at the
same time (that is reported as an ERROR "skipping sync of slot") so I
see no problem.

The CURRENT code is like this:

/* Slot created by the slot sync worker exists, sync it */
if (sync_state)
{
Assert(sync_state == SYNCSLOT_STATE_READY || sync_state ==
SYNCSLOT_STATE_INITIATED);
...
}
/* Otherwise create the slot first. */
else
{
...
}

AFAICT that could easily be changed to like below, with no change to
the logic, and it avoids setting strange values.

SUGGESTION.

if (sync_state == SYNCSLOT_STATE_NONE)
{
/* Slot not found. Create it. */
..
}
else
{
Assert(sync_state == SYNCSLOT_STATE_READY || sync_state ==
SYNCSLOT_STATE_INITIATED);
...
}

~~~

13. synchronize_one_slot

+static void
+synchronize_one_slot(WalReceiverConn *wrconn, RemoteSlot *remote_slot,
+ bool *slot_updated)

This *slot_updated parameter looks dubious. It is used in a loop from
the caller to mean that ANY slot was updated -- e.g. maybe it is true
or false on entry to this function.

But, Instead of having some dependency between this function and the
caller, IMO it makes more sense if we would make this just a boolean
function in the first place (e.g. was updated? T/F)

Then the caller can also be written more easily like:

some_slot_updated |= synchronize_one_slot(wrconn, remote_slot);

~~~

14.
+ /* Search for the named slot */
+ if ((s = SearchNamedReplicationSlot(remote_slot->name, true)))
+ {
+ SpinLockAcquire(&s->mutex);
+ sync_state = s->data.sync_state;
+ SpinLockRelease(&s->mutex);
+
+ /* User created slot with the same name exists, raise ERROR. */
+ if (sync_state == SYNCSLOT_STATE_NONE)
+ {
+ ereport(ERROR,
+ (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ errmsg("skipping sync of slot \"%s\" as it is a user created"
+ " slot", remote_slot->name),
+ errdetail("This slot has failover enabled on the primary and"
+ " thus is sync candidate but user created slot with"
+ " the same name already exists on the standby")));
+ }
+ }

Extra curly brackets around the ereport are not needed.

~~~

15.
+ /*
+ * Sanity check: With hot_standby_feedback enabled and
+ * invalidations handled apropriately as above, this should never
+ * happen.
+ */
+ if (remote_slot->restart_lsn < slot->data.restart_lsn)
+ {
+ elog(ERROR,
+ "not synchronizing local slot \"%s\" LSN(%X/%X)"
+ " to remote slot's LSN(%X/%X) as synchronization "
+ " would move it backwards", remote_slot->name,
+ LSN_FORMAT_ARGS(slot->data.restart_lsn),
+ LSN_FORMAT_ARGS(remote_slot->restart_lsn));
+ }

15a.
/apropriately/appropriately/

~

15b.
Extra curly brackets around the elog are not needed.

~~~

16. synchronize_slots

+static bool
+synchronize_slots(WalReceiverConn *wrconn)
+{
+#define SLOTSYNC_COLUMN_COUNT 9
+ Oid slotRow[SLOTSYNC_COLUMN_COUNT] = {TEXTOID, TEXTOID, LSNOID,
+ LSNOID, XIDOID, BOOLOID, BOOLOID, TEXTOID, INT2OID};
+
+ WalRcvExecResult *res;
+ TupleTableSlot *tupslot;
+ StringInfoData s;
+ List *remote_slot_list = NIL;
+ MemoryContext oldctx = CurrentMemoryContext;
+ ListCell *lc;
+ bool slot_updated = false;

Suggest renaming 'slot_updated' to 'some_slot_updated' or
'update_occurred' etc because the current name makes it look like it
applies to a single slot, but it doesn't.

~~~

17.
+ SpinLockAcquire(&WalRcv->mutex);
+ if (!WalRcv ||
+ (WalRcv->slotname[0] == '\0') ||
+ XLogRecPtrIsInvalid(WalRcv->latestWalEnd))
+ {
+ SpinLockRelease(&WalRcv->mutex);
+ return slot_updated;
+ }
+ SpinLockRelease(&WalRcv->mutex);

IMO "return false;" here is more clear than saying "return slot_updated;"

~~~

18.
+ appendStringInfo(&s,
+ "SELECT slot_name, plugin, confirmed_flush_lsn,"
+ " restart_lsn, catalog_xmin, two_phase, failover,"
+ " database, pg_get_slot_invalidation_cause(slot_name)"
+ " FROM pg_catalog.pg_replication_slots"
+ " WHERE failover and sync_state != 'i'");

18a.
/and/AND/

~

18b.
In the reply post (see [2]#32) Shveta said "I could not find quote_*
function for a character just like we have 'quote_literal_cstr' for
string". If you still want to use constant substitution instead of
just hardwired 'i' then why do even you need a quote_* function? I
thought the appendStringInfo uses a printf style format-string
internally, so I assumed it is possible to substitute the state char
directly using '%c'.

~~~

19.
+
+
+
+ /* We are done, free remote_slot_list elements */
+ list_free_deep(remote_slot_list);
+
+ walrcv_clear_result(res);
+
+ return slot_updated;
+}

Excessive blank lines.

~~~

20. validate_primary_slot

+ appendStringInfo(&cmd,
+ "SELECT count(*) = 1 from pg_replication_slots "
+ "WHERE slot_type='physical' and slot_name=%s",
+ quote_literal_cstr(PrimarySlotName));

/and/AND/

~~~

21.
+ slot = MakeSingleTupleTableSlot(res->tupledesc, &TTSOpsMinimalTuple);
+ tuple_ok = tuplestore_gettupleslot(res->tuplestore, true, false, slot);
+ Assert(tuple_ok); /* It must return one tuple */

IMO it's better to use all the var names the same across all
functions? So call this 'tupslot' like the other
MakeSingleTupleTableSlot result.

~~~

22. validate_slotsync_parameters

+/*
+ * Checks if GUCs are set appropriately before starting slot sync worker
+ *
+ * The slot sync worker can not start if 'enable_syncslot' is off and
+ * since 'enable_syncslot' is ON, check that the other GUC settings
+ * (primary_slot_name, hot_standby_feedback, wal_level, primary_conninfo)
+ * are compatible with slot synchronization. If not, raise ERROR.
+ */
+static void
+validate_slotsync_parameters(char **dbname)
+{

22a.
The comment is quite verbose. IMO the 2nd para seems just unnecessary
detail of the 1st para.

SUGGESTION
Check that all necessary GUCs for slot synchronization are set
appropriately. If not, raise an ERROR.

~~~

22b.
IMO (and given what was said in the comment about enable_syncslot must
be on)the first statement of this function should be:

/* Sanity check. */
Assert(enable_syncslot);

~~~

23. slotsync_reread_config

+ old_dbname = walrcv_get_dbname_from_conninfo(PrimaryConnInfo);
+ Assert(old_dbname);

(This is same comment as old review [1]#61)

Hmm. I still don't see why this extraction of the dbname cannot be
deferred until later when you know the PrimaryConnInfo has changed,
otherwise, it might be redundant to do this. Shveta replied [2] that
"Once PrimaryConnInfo is changed, we can not get old-dbname.", but I'm
not so sure. Isn't this walrcv_get_dbname_from_conninfo just doing a
string search -- Why can't you defer this until you know
conninfoChanged is true, and then to get the old_dbname, you can just
pass the old_primary_conninfo. E.g. call like
walrcv_get_dbname_from_conninfo(old_primary_conninfo); Maybe I am
mistaken.

~~

24.
+ /*
+ * Since we have initialized this worker with the old dbname, thus
+ * exit if dbname changed. Let it get restarted and connect to the new
+ * dbname specified.
+ */
+ if (conninfoChanged && strcmp(old_dbname, new_dbname) != 0)
+ ereport(ERROR,
+ errmsg("exiting slot sync worker as dbname in "
+ "primary_conninfo changed"));

IIUC when the tablesync has to restart, it emits a LOG message before
it exits; but it's not an ERROR. So, shouldn't this be similar -- IMO
it is not an "error" for the user to wish to change the dbname. Maybe
this should be LOG followed by an explicit exit. If you agree, then it
might be better to encapsulate such logic in some little function:

// pseudo-code
void slotsync_worker_restart(const char *msg)
{
ereport(LOG, msg...
exit(0);
}

~~~

25. ReplSlotSyncWorkerMain

+ for (;;)
+ {
+ int rc;
+ long naptime = WORKER_DEFAULT_NAPTIME_MS;
+ TimestampTz now;
+ bool slot_updated;
+
+ ProcessSlotSyncInterrupts(wrconn);
+
+ slot_updated = synchronize_slots(wrconn);

Here I think the 'slot_updated' should be renamed to the same name as
in #16 above (e.g. 'some_slot_updated' or 'any_slot_updated' or
'update_occurred' etc).

~~~

26. SlotSyncWorkerRegister

+ if (!enable_syncslot)
+ {
+ ereport(LOG,
+ errmsg("skipping slots synchronization because enable_syncslot is "
+ "disabled."));
+ return;
+ }

Instead of saying "because..." in the error message maybe keep the
message more terse and describe the "because" part in the errdetail

SUGGESTION
errmsg("skipping slot synchronization")
errdetail("enable_syncslot is disabled.")

======
src/backend/replication/slot.c

27.
+ * sync_state: Defines slot synchronization state. For user created slots, it
+ * is SYNCSLOT_STATE_NONE and for the slots being synchronized on the physical
+ * standby, it is either SYNCSLOT_STATE_INITIATED or SYNCSLOT_STATE_READY
*/
void
ReplicationSlotCreate(const char *name, bool db_specific,
ReplicationSlotPersistency persistency,
- bool two_phase, bool failover)
+ bool two_phase, bool failover, char sync_state)

27a.
Why is this comment even mentioning SYNCSLOT_STATE_READY? IIUC it
doesn't make sense to ever call ReplicationSlotCreate directly setting
the 'r' state (e.g., bypassing 'i' ???)

~

27b.
Indeed, IMO there should be Assert(sync_state == SYNCSLOT_STATE_NONE
|| syncstate == SYNCSLOT_STATE_INITIATED); to guarantee this.

======
src/include/replication/slot.h

28.
+ /*
+ * Is this a slot created by a sync-slot worker?
+ *
+ * Only relevant for logical slots on the physical standby.
+ */
+ char sync_state;
+

(probably I am repeating a previous thought here)

The comment says the field is only relevant for standby, and that's
how I've been visualizing it, and why I had previously suggested even
renaming it to 'standby_sync_state'. However, replies are saying that
after failover these sync_states also have "some meaning for the
primary server".

That's the part I have trouble understanding. IIUC the server states
are just either all 'n' (means nothing) or 'r' because they are just
leftover from the old standby state. So, does it *truly* have meaning
for the server? Or should those states somehow be removed/ignored on
the new primary? Anyway, the point is that if this field does have
meaning also on the primary (I doubt) then those details should be in
this comment. Otherwise "Only relevant ... on the standby" is too
misleading.

======
.../t/050_standby_failover_slots_sync.pl

29.
+# Create table and publication on primary
+$primary->safe_psql('postgres', "CREATE TABLE tab_mypub3 (a int
PRIMARY KEY);");
+$primary->safe_psql('postgres', "CREATE PUBLICATION mypub3 FOR TABLE
tab_mypub3;");
+

29a.
/on primary/on the primary/

~

29b.
Consider to combine those DDL

~

29c.
Perhaps for consistency, you should be calling this 'regress_mypub3'.

~~~

30.
+# Create a subscriber node
+my $subscriber3 = PostgreSQL::Test::Cluster->new('subscriber3');
+$subscriber3->init(allows_streaming => 'logical');
+$subscriber3->start;
+$subscriber3->safe_psql('postgres', "CREATE TABLE tab_mypub3 (a int
PRIMARY KEY);");
+
+# Create a subscription with failover = true & wait for sync to complete.
+$subscriber3->safe_psql('postgres',
+ "CREATE SUBSCRIPTION mysub3 CONNECTION '$publisher_connstr' "
+ . "PUBLICATION mypub3 WITH (slot_name = lsub3_slot, failover = true);");
+$subscriber3->wait_for_subscription_sync;

30a
Consider combining those DDLs.

~

30b.
Probably for consistency, you should be calling this 'regress_mysub3'.

~~~

31.
+# Advance lsn on the primary
+$primary->safe_psql('postgres',
+ "SELECT pg_log_standby_snapshot();");
+$primary->safe_psql('postgres',
+ "SELECT pg_log_standby_snapshot();");
+$primary->safe_psql('postgres',
+ "SELECT pg_log_standby_snapshot();");
+

Consider combining all those DDLs.

~~~

32.
+# Truncate table on primary
+$primary->safe_psql('postgres',
+ "TRUNCATE TABLE tab_mypub3;");
+
+# Insert data on the primary
+$primary->safe_psql('postgres',
+ "INSERT INTO tab_mypub3 SELECT generate_series(1, 10);");
+

Consider combining those DDLs.

~~~

33.
+# Confirm that restart_lsn of lsub3_slot slot is synced to the standby
+$result = $standby3->safe_psql('postgres',
+ qq[SELECT '$primary_lsn' >= restart_lsn from pg_replication_slots
WHERE slot_name = 'lsub3_slot';]);
+is($result, 't', 'restart_lsn of slot lsub3_slot synced to standby');

Does "'$primary_lsn' >= restart_lsn" make sense here? NOTE, the sign
was '<=' in v43-0002

~~~

34.
+# Confirm that confirmed_flush_lsn of lsub3_slot slot is synced to the standby
+$result = $standby3->safe_psql('postgres',
+ qq[SELECT '$primary_lsn' >= confirmed_flush_lsn from
pg_replication_slots WHERE slot_name = 'lsub3_slot';]);
+is($result, 't', 'confirmed_flush_lsn of slot lsub3_slot synced to
the standby');

Does "'$primary_lsn' >= confirmed_flush_lsn" make sense here? NOTE,
the sign was '<=' in v43-0002

~~~

35.
+##################################################
+# Test that synchronized slot can neither be docoded nor dropped by the user
+##################################################

35a.
/docoded/decoded/

~

35b.
Please give explanation in the comment *why* those ops are not allowed
(e.g. because the hot_standby_feedback GUC does not have an accepted
value)

~~~

36.
+##################################################
+# Create another slot which stays in sync_state as initiated ('i')
+##################################################
+

Please explain the comment as to *why* it gets stuck in the initiated state.

~~~

37.
+##################################################
+# Promote the standby3 to primary. Confirm that:
+# a) the sync-ready('r') slot 'lsub3_slot' is retained on new primary
+# b) the initiated('i') slot 'logical_slot'is dropped on promotion
+# c) logical replication for mysub3 is resumed succesfully after failover
+##################################################

/'logical_slot'is/'logical_slot' is/ (missing space)

/succesfully/successfully/

~~~

38.
+# Update subscription with new primary's connection info
+$subscriber3->safe_psql('postgres', "ALTER SUBSCRIPTION mysub3 DISABLE;");
+$subscriber3->safe_psql('postgres', "ALTER SUBSCRIPTION mysub3
CONNECTION '$standby3_conninfo';");
+$subscriber3->safe_psql('postgres', "ALTER SUBSCRIPTION mysub3 ENABLE;");

Consider combining all those DDLs.

~~~

39.
+
+# Insert data on the new primary
+$standby3->safe_psql('postgres',
+ "INSERT INTO tab_mypub3 SELECT generate_series(11, 20);");
+
+# Confirm that data in tab_mypub3 replicated on subscriber
+is( $subscriber3->safe_psql('postgres', q{SELECT count(*) FROM tab_mypub3;}),
+ "20",
+ 'data replicated from new primary');

Shouldn't there be some wait_for_subscription_sync logic (or similar)
here just to ensure the subscriber3 had time to receive that data
before you immediately check that it had arrived?

======
[1] My v43-0002 review.
https://www.postgresql.org/message-id/CAHut%2BPuuqEpDse5msENsVuK3rjTRN-QGS67rRCGVv%2BzcT-f0GA%40mail.gmail.com
[2] Replies to v43-0002 review.
https://www.postgresql.org/message-id/CAJpy0uDcOf5Hvk_CdCCAbfx9SY%2Bog%3D%3D%3DtgiuhWKzkYyqebui9g%40mail.gmail.com

Kind Regards,
Peter Smith.
Fujitsu Australia

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Corey Huinker 2023-12-13 10:26:04 Re: Statistics Import and Export
Previous Message jian he 2023-12-13 09:59:16 Re: remaining sql/json patches