Re: Synchronizing slots from primary to standby

From: shveta malik <shveta(dot)malik(at)gmail(dot)com>
To: "Hayato Kuroda (Fujitsu)" <kuroda(dot)hayato(at)fujitsu(dot)com>
Cc: Ajin Cherian <itsajin(at)gmail(dot)com>, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, "Drouvot, Bertrand" <bertranddrouvot(dot)pg(at)gmail(dot)com>, Bharath Rupireddy <bharath(dot)rupireddyforpostgres(at)gmail(dot)com>, Peter Eisentraut <peter(dot)eisentraut(at)enterprisedb(dot)com>, Bruce Momjian <bruce(at)momjian(dot)us>, Peter Smith <smithpb2250(at)gmail(dot)com>, Ashutosh Sharma <ashu(dot)coek88(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, Alvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org>, shveta malik <shveta(dot)malik(at)gmail(dot)com>
Subject: Re: Synchronizing slots from primary to standby
Date: 2023-09-27 12:03:19
Message-ID: CAJpy0uDD+9aJnDx9fBfvLvxJtxA7qqoAys4fo6h1tq1b_0_A7Q@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Mon, Sep 25, 2023 at 7:46 PM Hayato Kuroda (Fujitsu)
<kuroda(dot)hayato(at)fujitsu(dot)com> wrote:
>
> Dear Ajin, Shveta,
>
> Thank you for rebasing the patch set! Here are new comments for v19_2-0001.
>

Thank You Kuroda-san for the feedback. Most of these are addressed in
v20. Please find my response inline.

> 01. WalSndWaitForStandbyNeeded()
>
> ```
> if (SlotIsPhysical(MyReplicationSlot))
> return false;
> ```
>
> Is there a possibility that physical walsenders call this function?
> IIUC following is a stacktrace for the function, so the only logical walsenders use it.
> If so, it should be Assert() instead of an if statement.
>
> logical_read_xlog_page()
> WalSndWaitForWal()
> WalSndWaitForStandbyNeeded()

It will only be called from logical-walsenders. Modified as you suggested.

>
> 02. WalSndWaitForStandbyNeeded()
>
> Can we set shouldwait in SlotSyncInitConfig()? synchronize_slot_names_list is
> searched whenever the function is called, but it is not changed automatically.
> If the slotname is compared with the list in the SlotSyncInitConfig(), the
> liner search can be reduced.

standby_slot_names and synchronize_slot_names will be removed in the
next version as per discussion in [1] and thus SlotSyncInitConfig()
will not be needed. It will be replaced by new functionality. So I am
currently leaving it as is.

>
> 03. WalSndWaitForStandbyConfirmation()
>
> We should add ProcessRepliesIfAny() during the loop, otherwise the walsender
> overlooks the death of an apply worker.
>

Done.

> 04. WalSndWaitForStandbyConfirmation()
>
> Not sure, but do we have to return early if walsenders got PROCSIG_WALSND_INIT_STOPPING
> signal? I thought that if physical walsenders get stuck, logical walsenders wait
> forever. At that time we cannot stop the primary server even if "pg_ctl stop"
> is executed.
>

yes, right. I have added CHECK_FOR_INTERRUPTS() and 'got_STOPPING'
handling now which I think should suffice to process
PROCSIG_WALSND_INIT_STOPPING.

> 05. SlotSyncInitConfig()
>
> Why don't we free the memory for rawname, old standby_slot_names_list, and synchronize_slot_names_list?
> They seem to be overwritten.
>

Skipped for the time being due to reasons stated in pt 2.

> 06. SlotSyncInitConfig()
>
> Both physical and logical walsenders call the func, but physical one do not use
> lists, right? If so, can we add a quick exit for physical walsenders?
> Or, we should carefully remove where physical calls it.
>
> 07. StartReplication()
>
> I think we do not have to call SlotSyncInitConfig().
> Alternative approach is written in above.
>

I have removed it from StartReplication()

> 08. the other
>
> Also, I found the unexpected behavior after both 0001 and 0002 were applied.
> Was it normal or not?
>
> 1. constructed below setup
> (ensured that logical slot existed on secondary)
> 2. stopped the primary
> 3. promoted the secondary server
> 4. disabled a subscription once
> 5. changed the connection string for subscriber
> 6. Inserted data to new primary
> 7. enabled the subscription again
> 8. got an ERROR: replication slot "sub" does not exist
>
> I expected that the logical replication would be restarted, but it could not.
> Was it real issue or my fault? The error would appear in secondary.log.
>
> ```
> Setup:
> primary--->secondary
> |
> |
> subscriber
> ```

I have attached the new test-script (v2), can you please try that on
the v20 set of patches. We should let the slot creation complete first
on standby and then try promotion. I have added a few extra lines in
v2 of your script for the same. In the test-case, primary's
restart-lsn was lagging behind
new-slot's restart_lsn on standby and thus standby was waiting for
primary to catch-up. Meanwhile standby got promoted and thus slot
creation got aborted. That is the reason you were not able to get the
logical replication working on the new primary. v20 has improved
handling and better logging for such a case. Please try the attached
test-script on v20.

[1]: https://www.postgresql.org/message-id/CAJpy0uA%2Bt3XP2M0qtEmrOG1gSwHghjHPno5AtwTXM-94-%2Bc6JQ%40mail.gmail.com

thanks
Shveta

Attachment Content-Type Size
test_0925_v2.sh text/x-sh 3.1 KB

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Alexander Pyhalov 2023-09-27 12:36:34 Re: Partial aggregates pushdown
Previous Message Étienne BERSAC 2023-09-27 11:57:50 Re: logfmt and application_context