Re: Synchronizing slots from primary to standby

From: shveta malik <shveta(dot)malik(at)gmail(dot)com>
To: "Hayato Kuroda (Fujitsu)" <kuroda(dot)hayato(at)fujitsu(dot)com>
Cc: Bharath Rupireddy <bharath(dot)rupireddyforpostgres(at)gmail(dot)com>, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, "Drouvot, Bertrand" <bertranddrouvot(dot)pg(at)gmail(dot)com>, Peter Eisentraut <peter(dot)eisentraut(at)enterprisedb(dot)com>, Bruce Momjian <bruce(at)momjian(dot)us>, Ashutosh Sharma <ashu(dot)coek88(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, Ajin Cherian <itsajin(at)gmail(dot)com>, Alvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org>, Peter Smith <smithpb2250(at)gmail(dot)com>, shveta malik <shveta(dot)malik(at)gmail(dot)com>
Subject: Re: Synchronizing slots from primary to standby
Date: 2023-10-04 05:54:20
Message-ID: CAJpy0uCALOpGY+9vdD=qaCgoPOeLfGpDzVOt3qEJHCnpQ9SqWw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Mon, Oct 2, 2023 at 4:29 PM Hayato Kuroda (Fujitsu)
<kuroda(dot)hayato(at)fujitsu(dot)com> wrote:
>
> Dear Shveta,
>
> Thank you for updating the patch!
>
> I found another ERROR due to the slot removal. Is this a real issue?
>
> 1. applied add_sleep.txt, which emulated the case the tablesync worker stucked
> and the primary crashed during the
> initial sync.
> 2. executed test_0925_v2.sh (You attached in [1])
> 3. secondary could not start the logical replication because the slot was not
> created (log files were also attached).
>
>
> Here is my analysis. The cause is that the slotsync worker aborts the slot creation
> on secondary server because the restart_lsn of secondary ahead the primary's one.
> IIUC it can be occurred when tablesync workers finishes initial copy before
> walsenders stream changes. In this case, the relstate of the worker is set to
> SUBREL_STATE_CATCHUP and the apply worker waits till the relation becomes
> SUBREL_STATE_SYNCDONE. From here the slot on primary will not be updated until
> the relation is caught up. If some changes are come and the primary crashes at
> that time, the syncslot worker will abort the slot creation.
>

Kuroda-San, we need to let slot-creation on standby finish before we
start expecting it to support logical replication on failover. In the
current case, as you stated the slot-creation itself is aborted and
thus it can not support logical-replication later. We are currently
trying to think of possibilities to advance remote_lsn on primary
internally by slot-sync workers in order to accelerate slot-creation
on standby for cases where slot-creation is stuck due to primary's
restart_lsn lagging behind standby's restart_lsn. But till then, the
way to proceed for testing is to execute workload on primary for such
cases in order to accelerate slot-creation.

thanks
Shveta

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Tatsuo Ishii 2023-10-04 06:03:28 Re: Row pattern recognition
Previous Message shveta malik 2023-10-04 05:00:48 Re: Synchronizing slots from primary to standby