RE: Synchronizing slots from primary to standby

From: "Zhijie Hou (Fujitsu)" <houzj(dot)fnst(at)fujitsu(dot)com>
To: Peter Smith <smithpb2250(at)gmail(dot)com>
Cc: Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Dilip Kumar <dilipbalaut(at)gmail(dot)com>, Bertrand Drouvot <bertranddrouvot(dot)pg(at)gmail(dot)com>, shveta malik <shveta(dot)malik(at)gmail(dot)com>, Nisha Moond <nisha(dot)moond412(at)gmail(dot)com>, "Hayato Kuroda (Fujitsu)" <kuroda(dot)hayato(at)fujitsu(dot)com>, Bharath Rupireddy <bharath(dot)rupireddyforpostgres(at)gmail(dot)com>, Peter Eisentraut <peter(dot)eisentraut(at)enterprisedb(dot)com>, Bruce Momjian <bruce(at)momjian(dot)us>, Ashutosh Sharma <ashu(dot)coek88(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, Ajin Cherian <itsajin(at)gmail(dot)com>, Alvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org>
Subject: RE: Synchronizing slots from primary to standby
Date: 2024-01-16 12:00:29
Message-ID: OS0PR01MB57166281EA512BAE9A2F324F94732@OS0PR01MB5716.jpnprd01.prod.outlook.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tuesday, January 16, 2024 9:27 AM Peter Smith <smithpb2250(at)gmail(dot)com> wrote:
>
> Here are some review comments for patch v61-0002

Thanks for the comments.

>
> ======
> doc/src/sgml/logical-replication.sgml
>
> 1.
> + <sect2 id="logical-replication-failover-examples">
> + <title>Examples: logical replication failover</title>
>
> The current documentation structure (after the patch is applied) looks
> like this:
>
> 30.1. Publication
> 30.2. Subscription
> 30.2.1. Replication Slot Management
> 30.2.2. Examples: Set Up Logical Replication
> 30.2.3. Examples: Deferred Replication Slot Creation
> 30.2.4. Examples: logical replication failover
>
> I don't think it is ideal.
>
> Firstly, I think this new section is not just "Examples:"; it is more
> like instructions for steps to check if a successful failover is
> possible. IMO call it something like "Logical Replication Failover" or
> "Replication Slot Failover".
>
> Secondly, I don't think this new section strictly belongs underneath
> the "Subscription" section anymore because IMO it is just as much
> about the promotion of the publications. Now that you are adding this
> new (2nd) section about slots, I think the whole structure of this
> document should be changed like below:
>
> SUGGESTION #1 (make a new section 30.3 just for slot-related topics)
>
> 30.1. Publication
> 30.2. Subscription
> 30.2.1. Examples: Set Up Logical Replication
> 30.3. Logical Replication Slots
> 30.3.1. Replication Slot Management
> 30.3.2. Examples: Deferred Replication Slot Creation
> 30.3.3. Logical Replication Failover
>
> ~
>
> SUGGESTION #2 (keep the existing structure, but give the failover its
> own new section 30.3)
>
> 30.1. Publication
> 30.2. Subscription
> 30.2.1. Replication Slot Management
> 30.2.2. Examples: Set Up Logical Replication
> 30.2.3. Examples: Deferred Replication Slot Creation
> 30.3 Logical Replication Failover

I used this version for now as I am sure about changing other section.

>
> ~
>
> SUGGESTION #2a (and maybe later you can extract some of the failover
> examples further)
>
> 30.1. Publication
> 30.2. Subscription
> 30.2.1. Replication Slot Management
> 30.2.2. Examples: Set Up Logical Replication
> 30.2.3. Examples: Deferred Replication Slot Creation
> 30.3 Logical Replication Failover
> 30.3.1. Examples: Checking if failover ready
>
> ~~~
>
> 2.
> + <para>
> + In a logical replication setup, if the publisher server is also the primary
> + server of the streaming replication, the logical slots on the
> primary server
> + can be synchronized to the standby server by specifying
> <literal>failover = true</literal>
> + when creating the subscription. Enabling failover ensures a seamless
> + transition of the subscription to the promoted standby, allowing it to
> + subscribe to the new primary server without any data loss.
> + </para>
>
> I was initially confused by the wording. How about like below:
>
> SUGGESTION
> When the publisher server is the primary server of a streaming
> replication, the logical slots on that primary server can be
> synchronized to the standby server by specifying <literal>failover =
> true</literal> when creating subscriptions for those publications.
> Enabling failover ensures a seamless transition of those subscriptions
> after the standby is promoted. They can continue subscribing to
> publications now on the new primary server without any data loss.

Changed as suggested.

>
> ~~~
>
> 3.
> + <para>
> + However, the replication slots are copied asynchronously, which
> means it's necessary
> + to confirm that replication slots have been synced to the standby server
> + before the failover happens. Additionally, to ensure a successful failover,
> + the standby server must not lag behind the subscriber. To confirm
> + that the standby server is ready for failover, follow these steps:
> + </para>
>
> Minor rewording
>
> SUGGESTION
> Because the slot synchronization logic copies asynchronously, it is
> necessary to confirm that replication slots have been synced to the
> standby server before the failover happens. Furthermore, to ensure a
> successful failover, the standby server must not be lagging behind the
> subscriber. To confirm that the standby server is indeed ready for
> failover, follow these 2 steps:

Changed as suggested.

>
> ~~~
>
> 4.
> The instructions said "follow these steps", so the next parts should
> be rendered as 2 "steps" (using <procedure> markup?)
>
> SUGGESTION (show as steps 1,2 and also some minor rewording of the
> step heading)
>
> 1. Confirm that all the necessary logical replication slots have been
> synced to the standby server.
> 2. Confirm that the standby server is not lagging behind the subscribers.
>

Changed as suggested.

> ~~~
>
> 5.
> + <para>
> + Check if all the necessary logical replication slots have been synced to
> + the standby server.
> + </para>
>
> SUGGESTION
> Confirm that all the necessary logical replication slots have been
> synced to the standby server.
>

Changed as suggested.

> ~~~
>
> 6.
> + <listitem>
> + <para>
> + On logical subscriber, fetch the slot names that should be synced to
> the
> + standby that we plan to promote.
>
> SUGGESTION
> Firstly, on the subscriber node, use the following SQL to identify the
> slot names that should be...
>

Changed as suggested.

> ~~~
>
> 7.
> +<programlisting>
> +test_sub=# SELECT
> + array_agg(slotname) AS slots
> + FROM
> + ((
> + SELECT r.srsubid AS subid, CONCAT('pg_' || srsubid ||
> '_sync_' || srrelid || '_' || ctl.system_identifier) AS slotname
> + FROM pg_control_system() ctl, pg_subscription_rel r,
> pg_subscription s
> + WHERE r.srsubstate = 'f' AND s.oid = r.srsubid AND
> s.subfailover
> + ) UNION (
> + SELECT oid AS subid, subslotname as slotname
> + FROM pg_subscription
> + WHERE subfailover
> + ));
>
> 7a
> Maybe this ought to include "pg_catalog" schemas?

After searching other query examples, I think most of them don’t add this for
either function or system table. So, I didn’t add this.

>
> ~
>
> 7b.
> For consistency, maybe it is better to use a table alias "FROM
> pg_subscription s" in the UNION also

Added.

>
> ~~~
>
> 8.
> + <listitem>
> + <para>
> + Check that the logical replication slots exist on the standby server.
>
> SUGGESTION
> Next, check that the logical replication slots identified above exist
> on the standby server.

Changed as suggested.

>
> ~~~
>
> 9.
> +<programlisting>
> +test_standby=# SELECT bool_and(synced AND NOT temporary AND
> conflict_reason IS NULL) AS failover_ready
> + FROM pg_replication_slots
> + WHERE slot_name in ('slots');
> + failover_ready
> +----------------
> + t
>
> 9a.
> Maybe this ought to include "pg_catalog" schemas?

Same as above.

>
> ~
>
> 9b.
> IIUC that 'slots' reference is supposed to be those names that were
> found in the prior step. If so, then that point needs to be made
> clear, and anyway in this case 'slots' is not compatible with the
> 'sub' name returned by your first SQL.

Changed as suggested.

>
> ~~~
>
> 10.
> + <listitem>
> + <para>
> + Query the last replayed WAL on the logical subscriber.
>
> SUGGESTION
> Firstly, on the subscriber node check the last replayed WAL.
>
Changed as suggested.

> ~~~
>
> 11.
> +<programlisting>
> +test_sub=# SELECT
> + MAX(remote_lsn) AS remote_lsn_on_subscriber
> + FROM
> + ((
> + SELECT (CASE WHEN r.srsubstate = 'f' THEN
> pg_replication_origin_progress(CONCAT('pg_' || r.srsubid || '_' ||
> r.srrelid), false)
> + WHEN r.srsubstate = 's' THEN r.srsublsn
> END) as remote_lsn
> + FROM pg_subscription_rel r, pg_subscription s
> + WHERE r.srsubstate IN ('f', 's') AND s.oid = r.srsubid
> AND s.subfailover
> + ) UNION (
> + SELECT pg_replication_origin_progress(CONCAT('pg_' ||
> s.oid), false) AS remote_lsn
> + FROM pg_subscription s
> + WHERE subfailover
> + ));
>
> 11a.
> Maybe this ought to include "pg_catalog" schemas?

Same as above.

>
> ~
>
> 11b.
> /WHERE subfailover/WHERE s.subfailover/
>
> ~~~
>
> 12.
> + <listitem>
> + <para>
> + On the standby server, check that the last-received WAL location
> + is ahead of the replayed WAL location on the subscriber.
>
> SUGGESTION
> Next, on the standby server check that the last-received WAL location
> is ahead of the replayed WAL location on the subscriber identified
> above.
>

Changed as suggested.

> ~~~
>
> 13.
> +</programlisting></para>
> + </listitem>
> + <listitem>
> + <para>
> + On the standby server, check that the last-received WAL location
> + is ahead of the replayed WAL location on the subscriber.
> +<programlisting>
> +test_standby=# SELECT pg_last_wal_receive_lsn() >=
> 'remote_lsn_on_subscriber'::pg_lsn AS failover_ready;
> + failover_ready
> +----------------
> + t
>
> IIUC the 'remote_lsn_on_subscriber' is supposed to represent the
> substitution of the value found in the subscriber server. In this
> example maybe it would be:
> SELECT pg_last_wal_receive_lsn() >= '0/3000388'::pg_lsn AS failover_ready;
>
> maybe that point can be made more clearly.

I have changed it to use the actual LSN got in last step.

>
> ~~~
>
> 14.
> + <para>
> + If the result (failover_ready) of both above steps is true, it means it is
> + okay to subscribe to the standby server.
> + </para>
>
> 14a.
> failover_ready should be rendered as literal.

Added.

>
> ~
>
> 14b.
> Does this say what you intended, or did you mean something more like
> "the standby can be promoted and existing subscriptions will be able
> to continue without data loss"

I used the later part of your suggestion as I think promotion
depends not only on logical replication part.

Best Regards,
Hou zj

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message shveta malik 2024-01-16 12:02:20 Re: Synchronizing slots from primary to standby
Previous Message shveta malik 2024-01-16 11:57:05 Re: Synchronizing slots from primary to standby