From: | Shlok Kyal <shlok(dot)kyal(dot)oss(at)gmail(dot)com> |
---|---|
To: | "Hayato Kuroda (Fujitsu)" <kuroda(dot)hayato(at)fujitsu(dot)com> |
Cc: | vignesh C <vignesh21(at)gmail(dot)com>, Euler Taveira <euler(at)eulerto(dot)com>, "duffieldzane(at)gmail(dot)com" <duffieldzane(at)gmail(dot)com>, "pgsql-bugs(at)lists(dot)postgresql(dot)org" <pgsql-bugs(at)lists(dot)postgresql(dot)org>, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> |
Subject: | Re: BUG #18897: Logical replication conflict after using pg_createsubscriber under heavy load |
Date: | 2025-07-23 04:35:01 |
Message-ID: | CANhcyEUs+_fgmd61jWiSvwxYz+-DGgL00q=C5ZdoYaj9D9baWw@mail.gmail.com |
Views: | Whole Thread | Raw Message | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-bugs |
On Tue, 22 Jul 2025 at 17:51, Hayato Kuroda (Fujitsu)
<kuroda(dot)hayato(at)fujitsu(dot)com> wrote:
>
> Dear Shlok,
>
> > I checked it and here is my analysis:
> >
> > When we create a slot, it returns the confirmed_flush LSN as a
> > consistent_lsn. I noticed that in general when we create a slot, the
> > confirmed_flush is set to the end of a RUNNING_XACT log or we can say
> > start of the next record. And this next record can be anything. Ii can
> > be a COMMIT record for a transaction in another session.
> > I have attached server logs and waldump logs for one of such case
> > reproduced using test script shared in [1].
> > The snapbuild machinery has four steps: START, BUILDING_SNAPSHOT,
> > FULL_SNAPSHOT and SNAPBUILD_CONSISTENT. Between each step a
> > RUNNING_XACT is logged.
> ...
>
> Thanks for the analysis! It is quite helpful. Based on your point I understood
> like below. Are they correct?
>
> Facts:
> =====
> 1.
> RUNNING_XACT records can be generated when the snapshot status is advanced while
> creating the slot.
> 2.
> pg_create_logical_replication_slot() returns the end point of RUNNING_XACT.
> It was generated when the snapshot becomes SNAPBUILD_CONSISTENT.
> 3.
> Some transactions could be started while the snapshot is FULL_SNAPSHOT state, and
> they can be committed after we reached SNAPBUILD_CONSISTENT. Such transactions
> should be output by the upcoming logical decoding.
>
> What happened here:
> =================
> a.
> confirmed_flush_lsn was 0/03CBCCA0, which is end of RUNNING_XACT (lsn: 0/03CBCC58).
> Also, a COMMIT record for txn 1369 located *just after* the RUNNING_XACT [1].
> b.
> pg_createsubscriber set the recovery_target_lsn to "0/03CBCCA0", and
> recovery_target_inclusive was true. This meant record stared from "0/03CBCCA0"
> must be applied.
> c.
> startup process applied till that point. Transaction 1369 was applied and then the
> standby could be promoted.
> e.
> logical walsender decoded transaction 1369 and replicated it to the standby.
> However, it has already been applied by startup thus conflict could happen.
>
> [1]:
> according to the log:
> ```
> ...
> rmgr: Standby len (rec/tot): 70/ 70, tx: 0, lsn: 0/03CBCC58, prev 0/03CBCC18, desc: RUNNING_XACTS nextXid 1370 latestCompletedXid 1364 oldestRunningXid 1365; 5 xacts: 1366 1365 1369 1368 1367
> rmgr: Transaction len (rec/tot): 46/ 46, tx: 1369, lsn: 0/03CBCCA0, prev 0/03CBCC58, desc: COMMIT 2025-07-20 16:50:18.031146 IST
> ...
> ```
>
> Best regards,
> Hayato Kuroda
> FUJITSU LIMITED
>
Hi Kuroda-san,
Thanks for reviewing the thread. Your understanding is correct.
Thanks,
Shlok Kyal
From | Date | Subject | |
---|---|---|---|
Next Message | Amit Kapila | 2025-07-23 09:19:42 | Re: BUG #18897: Logical replication conflict after using pg_createsubscriber under heavy load |
Previous Message | Dilip Kumar | 2025-07-23 03:50:02 | Re: BUG #18992: Autovacuum triggering assert - LWLockAnyHeldByMe |