From: | "Hayato Kuroda (Fujitsu)" <kuroda(dot)hayato(at)fujitsu(dot)com> |
---|---|
To: | 'Shlok Kyal' <shlok(dot)kyal(dot)oss(at)gmail(dot)com> |
Cc: | vignesh C <vignesh21(at)gmail(dot)com>, Euler Taveira <euler(at)eulerto(dot)com>, "duffieldzane(at)gmail(dot)com" <duffieldzane(at)gmail(dot)com>, "pgsql-bugs(at)lists(dot)postgresql(dot)org" <pgsql-bugs(at)lists(dot)postgresql(dot)org>, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> |
Subject: | RE: BUG #18897: Logical replication conflict after using pg_createsubscriber under heavy load |
Date: | 2025-07-22 12:21:33 |
Message-ID: | OSCPR01MB14966F6D3C733B8581C718CFAF55CA@OSCPR01MB14966.jpnprd01.prod.outlook.com |
Views: | Whole Thread | Raw Message | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-bugs |
Dear Shlok,
> I checked it and here is my analysis:
>
> When we create a slot, it returns the confirmed_flush LSN as a
> consistent_lsn. I noticed that in general when we create a slot, the
> confirmed_flush is set to the end of a RUNNING_XACT log or we can say
> start of the next record. And this next record can be anything. Ii can
> be a COMMIT record for a transaction in another session.
> I have attached server logs and waldump logs for one of such case
> reproduced using test script shared in [1].
> The snapbuild machinery has four steps: START, BUILDING_SNAPSHOT,
> FULL_SNAPSHOT and SNAPBUILD_CONSISTENT. Between each step a
> RUNNING_XACT is logged.
...
Thanks for the analysis! It is quite helpful. Based on your point I understood
like below. Are they correct?
Facts:
=====
1.
RUNNING_XACT records can be generated when the snapshot status is advanced while
creating the slot.
2.
pg_create_logical_replication_slot() returns the end point of RUNNING_XACT.
It was generated when the snapshot becomes SNAPBUILD_CONSISTENT.
3.
Some transactions could be started while the snapshot is FULL_SNAPSHOT state, and
they can be committed after we reached SNAPBUILD_CONSISTENT. Such transactions
should be output by the upcoming logical decoding.
What happened here:
=================
a.
confirmed_flush_lsn was 0/03CBCCA0, which is end of RUNNING_XACT (lsn: 0/03CBCC58).
Also, a COMMIT record for txn 1369 located *just after* the RUNNING_XACT [1].
b.
pg_createsubscriber set the recovery_target_lsn to "0/03CBCCA0", and
recovery_target_inclusive was true. This meant record stared from "0/03CBCCA0"
must be applied.
c.
startup process applied till that point. Transaction 1369 was applied and then the
standby could be promoted.
e.
logical walsender decoded transaction 1369 and replicated it to the standby.
However, it has already been applied by startup thus conflict could happen.
[1]:
according to the log:
```
...
rmgr: Standby len (rec/tot): 70/ 70, tx: 0, lsn: 0/03CBCC58, prev 0/03CBCC18, desc: RUNNING_XACTS nextXid 1370 latestCompletedXid 1364 oldestRunningXid 1365; 5 xacts: 1366 1365 1369 1368 1367
rmgr: Transaction len (rec/tot): 46/ 46, tx: 1369, lsn: 0/03CBCCA0, prev 0/03CBCC58, desc: COMMIT 2025-07-20 16:50:18.031146 IST
...
```
Best regards,
Hayato Kuroda
FUJITSU LIMITED
From | Date | Subject | |
---|---|---|---|
Next Message | Guilherme Luis França | 2025-07-22 23:23:25 | [PERFORMANCE] regression: EXECUTE loop TRUNCATE slower in 17.5 vs 17.2 |
Previous Message | Shlok Kyal | 2025-07-22 11:24:29 | Re: BUG #18897: Logical replication conflict after using pg_createsubscriber under heavy load |