RE: Synchronizing slots from primary to standby

From: "Zhijie Hou (Fujitsu)" <houzj(dot)fnst(at)fujitsu(dot)com>
To: Peter Smith <smithpb2250(at)gmail(dot)com>, shveta malik <shveta(dot)malik(at)gmail(dot)com>
Cc: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>, "Drouvot, Bertrand" <bertranddrouvot(dot)pg(at)gmail(dot)com>, Nisha Moond <nisha(dot)moond412(at)gmail(dot)com>, "Hayato Kuroda (Fujitsu)" <kuroda(dot)hayato(at)fujitsu(dot)com>, Bharath Rupireddy <bharath(dot)rupireddyforpostgres(at)gmail(dot)com>, Peter Eisentraut <peter(dot)eisentraut(at)enterprisedb(dot)com>, Bruce Momjian <bruce(at)momjian(dot)us>, Ashutosh Sharma <ashu(dot)coek88(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, Ajin Cherian <itsajin(at)gmail(dot)com>, Alvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org>
Subject: RE: Synchronizing slots from primary to standby
Date: 2024-01-09 12:15:46
Message-ID: OS0PR01MB571663BCE8B28597D462FADE946A2@OS0PR01MB5716.jpnprd01.prod.outlook.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tuesday, January 9, 2024 9:17 AM Peter Smith <smithpb2250(at)gmail(dot)com> wrote:
>
> Here are some review comments for patch v57-0001.

Thanks for the comments!

>
> ======
> doc/src/sgml/protocol.sgml
>
> 1. CREATE_REPLICATION_SLOT ... FAILOVER
>
> + <varlistentry>
> + <term><literal>FAILOVER [ <replaceable
> class="parameter">boolean</replaceable> ]</literal></term>
> + <listitem>
> + <para>
> + If true, the slot is enabled to be synced to the physical
> + standbys so that logical replication can be resumed after failover.
> + </para>
> + </listitem>
> + </varlistentry>
>
> This syntax says passing the boolean value is optional. So the default needs to
> the specified here in the docs (like what the TWO_PHASE option does).
>
> ~~~
>
> 2. ALTER_REPLICATION_SLOT ... FAILOVER
>
> + <variablelist>
> + <varlistentry>
> + <term><literal>FAILOVER [ <replaceable
> class="parameter">boolean</replaceable> ]</literal></term>
> + <listitem>
> + <para>
> + If true, the slot is enabled to be synced to the physical
> + standbys so that logical replication can be resumed after failover.
> + </para>
> + </listitem>
> + </varlistentry>
> + </variablelist>
>
> This syntax says passing the boolean value is optional. So it needs to be
> specified here in the docs that not passing a value would be the same as
> passing the value true.

The behavior that "not passing a value would be the same as passing the value
true " is due to the rule of defGetBoolean(). And all the options of commands
in this document behave the same in this case, therefore I think we'd better
add document for it in a general place in a separate patch/thread instead of
mentioning this in each option's paragraph.

>
> ======
> doc/src/sgml/ref/alter_subscription.sgml
>
> 3.
> + If <link
> linkend="sql-createsubscription-params-with-failover"><literal>failover</lit
> eral></link>
> + is enabled, you can temporarily disable it in order to execute
> these commands.
>
> /in order to/to/

This part has been removed due to design change.

>
> ~~~
>
> 4.
> + <para>
> + When altering the
> + <link
> linkend="sql-createsubscription-params-with-slot-name"><literal>slot_nam
> e</literal></link>,
> + the <literal>failover</literal> property of the new slot may
> differ from the
> + <link
> linkend="sql-createsubscription-params-with-failover"><literal>failover</lit
> eral></link>
> + parameter specified in the subscription. When creating the slot,
> + ensure the slot failover property matches the
> + <link
> linkend="sql-createsubscription-params-with-failover"><literal>failover</lit
> eral></link>
> + parameter value of the subscription.
> + </para>
>
> 4a.
> the <literal>failover</literal> property of the new slot may differ
>
> Maybe it would be more clear if that said "the failover property value of the
> named slot...".

Changed.

>
> ~
>
> 4b.
> In the "failover property matches" part should that failover also be rendered as
> <literal> like before in the same paragraph?

Added.

>
> ======
> doc/src/sgml/system-views.sgml
>
> 5.
> + <row>
> + <entry role="catalog_table_entry"><para role="column_definition">
> + <structfield>failover</structfield> <type>bool</type>
> + </para>
> + <para>
> + True if this logical slot is enabled to be synced to the physical
> + standbys so that logical replication can be resumed from the new
> primary
> + after failover. Always false for physical slots.
> + </para></entry>
> + </row>
>
> /True if this logical slot is enabled.../True if this is a logical slot enabled.../

Changed.

>
> ======
> src/backend/commands/subscriptioncmds.c
>
> 6. CreateSubscription
>
> + /*
> + * Even if failover is set, don't create the slot with failover
> + * enabled. Will enable it once all the tables are synced and
> + * ready. The intention is that if failover happens at the time of
> + * table-sync, user should re-launch the subscription instead of
> + * relying on main slot (if synced) with no table-sync data
> + * present. When the subscription has no tables, leave failover as
> + * false to allow ALTER SUBSCRIPTION ... REFRESH PUBLICATION to
> + * work.
> + */
> + if (opts.failover && !opts.copy_data && tables != NIL)
> + failover_enabled = true;
>
> AFAICT it might be possible for this to set failover_enabled = true if copy_data
> is false. So failover_enabled would be true when later
> calling:
> walrcv_create_slot(wrconn, opts.slot_name, false, twophase_enabled,
> failover_enabled, CRS_NOEXPORT_SNAPSHOT, NULL);
>
> Isn't that contrary to what this comment said: "Even if failover is set, don't
> create the slot with failover enabled"

This part has been removed due to design change.

>
> ~~~
>
> 7. AlterSubscription. case ALTER_SUBSCRIPTION_OPTIONS:
>
> + if (!sub->slotname)
> + ereport(ERROR,
> + (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
> + errmsg("cannot set failover for subscription that does not have slot
> + name")));
>
> /for subscription that does not have slot name/for a subscription that does not
> have a slot name/

Changed.

>
> ======
> .../libpqwalreceiver/libpqwalreceiver.c
>
> 8.
> + if (PQresultStatus(res) != PGRES_COMMAND_OK) ereport(ERROR,
> + (errcode(ERRCODE_PROTOCOL_VIOLATION),
> + errmsg("could not alter replication slot \"%s\"", slotname)));
>
> This used to display the error message like
> pchomp(PQerrorMessage(conn->streamConn)) but it was removed. Is it OK?

I added this back.

>
> ======
> src/backend/replication/logical/tablesync.c
>
> 9.
> + if (MySubscription->twophasestate ==
> + LOGICALREP_TWOPHASE_STATE_PENDING)
> + ereport(LOG,
> + /* translator: %s is a subscription option */ (errmsg("logical
> + replication apply worker for subscription \"%s\"
> will restart so that %s can be enabled",
> + MySubscription->name, "two_phase")));
> +
> + if (MySubscription->failoverstate ==
> + LOGICALREP_FAILOVER_STATE_PENDING)
> + ereport(LOG,
> + /* translator: %s is a subscription option */ (errmsg("logical
> + replication apply worker for subscription \"%s\"
> will restart so that %s can be enabled",
> + MySubscription->name, "failover")));
>
> Those errors have multiple %s, so the translator's comment should say "the
> 2nd %s is a..."

This part has been removed due to design change.

>
> ~~~
>
> 10.
> void
> -UpdateTwoPhaseState(Oid suboid, char new_state)
> +EnableTwoPhaseFailoverTriState(Oid suboid, bool enable_twophase,
> + bool enable_failover)
>
> I felt the function name was a bit confusing. Maybe it is simpler to call it like
> "EnableTriState" or "EnableSubTriState" -- the parameters anyway specify what
> actual state(s) will be set.

This part has been removed due to design change.

>
> ======
> src/backend/replication/logical/worker.c
>
> 11.
> + /* Update twophase and/or failover */
> + EnableTwoPhaseFailoverTriState(MySubscription->oid, twophase_pending,
> + failover_pending);
> + if (twophase_pending)
> + MySubscription->twophasestate =
> LOGICALREP_TWOPHASE_STATE_ENABLED;
> +
> + if (failover_pending)
> + MySubscription->failoverstate = LOGICALREP_FAILOVER_STATE_ENABLED;
>
> Can't you pass the MySubscription as a parameter and then the
> EnableTwoPhaseFailoverTriState can also set these
> LOGICALREP_TWOPHASE_STATE_ENABLED/LOGICALREP_FAILOVER_STATE_EN
> ABLED
> states within the Enable* function?

This part has been removed due to design change.

>
> ======
> src/backend/replication/repl_gram.y
>
> 12.
> %token K_CREATE_REPLICATION_SLOT
> %token K_DROP_REPLICATION_SLOT
> +%token K_ALTER_REPLICATION_SLOT
>
> and
>
> + create_replication_slot drop_replication_slot alter_replication_slot
> + identify_system read_replication_slot timeline_history show
> + upload_manifest
>
> and
>
> | create_replication_slot
> | drop_replication_slot
> + | alter_replication_slot
>
> and
>
> | K_CREATE_REPLICATION_SLOT { $$ = "create_replication_slot"; }
> | K_DROP_REPLICATION_SLOT { $$ = "drop_replication_slot"; }
> + | K_ALTER_REPLICATION_SLOT { $$ = "alter_replication_slot"; }
>
> etc.
>
> ~
>
> Although it makes no difference IMO it is more natural to code everything in
> the order: create, alter, drop.
>
> ======
> src/backend/replication/repl_scanner.l
>
> 13.
> CREATE_REPLICATION_SLOT { return K_CREATE_REPLICATION_SLOT; }
> DROP_REPLICATION_SLOT { return K_DROP_REPLICATION_SLOT; }
> +ALTER_REPLICATION_SLOT { return K_ALTER_REPLICATION_SLOT; }
>
> and
>
> case K_CREATE_REPLICATION_SLOT:
> case K_DROP_REPLICATION_SLOT:
> + case K_ALTER_REPLICATION_SLOT:
>
> Although it makes no difference IMO it is more natural to code everything in
> the order: create, alter, drop.

Personally, I am not sure if it looks better, so I didn’t change this.

>
> ======
> src/backend/replication/slot.c
>
> 14.
> + if (SlotIsPhysical(MyReplicationSlot))
> + ereport(ERROR,
> + errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
> + errmsg("cannot use %s with a physical replication slot",
> + "ALTER_REPLICATION_SLOT"));
>
> /with a/for a/

This is to be consistent with another error message, so I didn’t change this.

errmsg("cannot use %s with a logical replication slot",
"READ_REPLICATION_SLOT"));

>
> ======
> src/backend/replication/walsender.c
>
> 15.
> +static void
> +ParseAlterReplSlotOptions(AlterReplicationSlotCmd *cmd, bool *failover)
> +{
> + ListCell *lc;
> + bool failover_given = false;
> +
> + /* Parse options */
> + foreach(lc, cmd->options)
> + {
> + DefElem *defel = (DefElem *) lfirst(lc);
>
> AFAIK there are some new-style macros now you can use for this code.
> e.g. foreach_ptr? See [1].

Changed.

>
> ~~~
>
> 16.
> + if (strcmp(defel->defname, "failover") == 0) { if (failover_given)
> + ereport(ERROR, (errcode(ERRCODE_SYNTAX_ERROR), errmsg("conflicting or
> + redundant options"))); failover_given = true; *failover =
> + defGetBoolean(defel); }
>
> The documented syntax showed that passing the boolean value for the
> FAILOVER option is not mandatory. Does this code work if the boolean value is
> not passed?

It works, defGetBoolean will handle this case.

>
> ======
> src/bin/psql/tab-complete.c
>
> 17.
> I think "ALTER SUBSCRIPTION ... SET (failover)" is possible, but the ALTER
> SUBSCRIPTION tab completion code is missing.

Added.

> ======
> .../t/050_standby_failover_slots_sync.pl
>
> 19.
> +
> +# Copyright (c) 2023, PostgreSQL Global Development Group
> +
>
> /2023/2024/

Changed.

>
> ~~~
>
> 20.
> +# Create another subscription (using the same slot created above) that
> +enables # failover.
> +$subscriber1->safe_psql(
> + 'postgres', qq[
> + CREATE TABLE tab_int (a int PRIMARY KEY); CREATE SUBSCRIPTION
> +regress_mysub1 CONNECTION '$publisher_connstr'
> PUBLICATION regress_mypub WITH (slot_name = lsub1_slot, copy_data=false,
> failover = true, create_slot = false);
>
> The comment should not say "Create another subscription" because this is the
> first subscription being created.
>
> /another/a/

Changed.

>
> ~~~
>
> 21.
> +##################################################
> +# Test if changing the failover property of a subscription updates the
> +# corresponding failover property of the slot.
> +##################################################
>
> /Test if/Test that/

Changed.

>
> ======
> src/test/regress/sql/subscription.sql
>
> 22.
> +CREATE SUBSCRIPTION regress_testsub CONNECTION
> 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false,
> failover = true);
> +
> +\dRs+
>
> This is currently only testing the explicit "failover=true".
>
> Maybe you can also test the other kinds work as expected:
> - explicit "SET (failover=false)"
> - explicit "SET (failover)" with no value specified

I think these tests don't add enough value to catch future bugs,
so I prefer not to add these.

Best Regards,
Hou zj

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Amit Kapila 2024-01-09 13:09:43 Re: Synchronizing slots from primary to standby
Previous Message Zhijie Hou (Fujitsu) 2024-01-09 12:14:04 RE: Synchronizing slots from primary to standby