Quick Links

Re: BUG #18897: Logical replication conflict after using pg_createsubscriber under heavy load

From:	Shlok Kyal <shlok(dot)kyal(dot)oss(at)gmail(dot)com>
To:	vignesh C <vignesh21(at)gmail(dot)com>
Cc:	Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Euler Taveira <euler(at)eulerto(dot)com>, duffieldzane(at)gmail(dot)com, pgsql-bugs(at)lists(dot)postgresql(dot)org
Subject:	Re: BUG #18897: Logical replication conflict after using pg_createsubscriber under heavy load
Date:	2025-07-23 13:25:49
Message-ID:	CANhcyEVy3q7LLBSUL2EDd2wV1uB=2BcMrd3JNBSYRHwS=NWNiQ@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-bugs

On Wed, 23 Jul 2025 at 14:55, vignesh C <vignesh21(at)gmail(dot)com> wrote:
>
> On Tue, 22 Jul 2025 at 16:54, Shlok Kyal <shlok(dot)kyal(dot)oss(at)gmail(dot)com> wrote:
> >
> > On Thu, 10 Jul 2025 at 12:33, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
> > >
> > > On Tue, Apr 29, 2025 at 1:17 PM Shlok Kyal <shlok(dot)kyal(dot)oss(at)gmail(dot)com> wrote:
> > > >
> > > > On Mon, 28 Apr 2025 at 10:28, vignesh C <vignesh21(at)gmail(dot)com> wrote:
> > > > >
> > > > > With this approach, there is a risk of starting from the next WAL
> > > > > record after the consistent point. For example, if the slot returns a
> > > > > consistent point at 0/1715E10, after the fix we would begin replaying
> > > > > from the next WAL record, such as 0/1715E40, which could potentially
> > > > > lead to data loss.
> > > > > As an alternative, we could set recovery_target_inclusive to false in
> > > > > the setup_recovery function. This way, recovery would stop just before
> > > > > the recovery target, allowing the publisher to start replicating
> > > > > exactly from the consistent point.
> > > > > Thoughts?
> > > >
> > > > This approach looks better to me.
> > > > I have prepared the patch for the same.
> > > >
> > >
> > > We should find out in which case and why the consisten_lsn is a start
> > > point LSN of a commit record. We use slot's confirm_flush LSN location
> > > as a consistent_lsn, which normally should be the end point of
> > > running_xacts record or commit_end LSN record (in case client sends
> > > ack).
> > >
> > I checked it and here is my analysis:
> >
> > When we create a slot, it returns the confirmed_flush LSN as a
> > consistent_lsn. I noticed that in general when we create a slot, the
> > confirmed_flush is set to the end of a RUNNING_XACT log or we can say
> > start of the next record. And this next record can be anything. Ii can
> > be a COMMIT record for a transaction in another session.
> > I have attached server logs and waldump logs for one of such case
> > reproduced using test script shared in [1].
> > The snapbuild machinery has four steps: START, BUILDING_SNAPSHOT,
> > FULL_SNAPSHOT and SNAPBUILD_CONSISTENT. Between each step a
> > RUNNING_XACT is logged.
> >
> > When the command to create a slot is executed, during the first
> > RUNNING_XACT, we get following logs:
> > 2025-07-20 16:50:18.002 IST [1780326] port=5340
> > ubuntu(at)test_db/[unknown] LOG: logical decoding found initial starting
> > point at 0/3C813D8
> > 2025-07-20 16:50:18.002 IST [1780326] port=5340
> > ubuntu(at)test_db/[unknown] DETAIL: Waiting for transactions
> > (approximately 2) older than 1353 to end.
> > 2025-07-20 16:50:18.002 IST [1780326] port=5340
> > ubuntu(at)test_db/[unknown] STATEMENT: SELECT lsn FROM
> > pg_catalog.pg_create_logical_replication_slot('sub', 'pgoutput',
> > false, false, false)
> >
> > waldump log corresponding to this:
> > rmgr: Standby len (rec/tot): 58/ 58, tx: 0, lsn:
> > 0/03C813D8, prev 0/03C81390, desc: RUNNING_XACTS nextXid 1353
> > latestCompletedXid 1350 oldestRunningXid 1351; 2 xacts: 1352 1351
> >
> > For second RUNNING_XACT, we get following logs:
> > 2025-07-20 16:50:18.007 IST [1780326] port=5340
> > ubuntu(at)test_db/[unknown] LOG: logical decoding found initial
> > consistent point at 0/3C97948
> > 2025-07-20 16:50:18.007 IST [1780326] port=5340
> > ubuntu(at)test_db/[unknown] DETAIL: Waiting for transactions
> > (approximately 8) older than 1362 to end.
> > 2025-07-20 16:50:18.007 IST [1780326] port=5340
> > ubuntu(at)test_db/[unknown] STATEMENT: SELECT lsn FROM
> > pg_catalog.pg_create_logical_replication_slot('sub', 'pgoutput',
> > false, false, false)
> >
> > waldump record:
> > len (rec/tot): 82/ 82, tx: 0, lsn: 0/03C97948, prev
> > 0/03C97900, desc: RUNNING_XACTS nextXid 1362 latestCompletedXid 1353
> > oldestRunningXid 1354; 8 xacts: 1357 1358 1355 1354 1359 1361 1356
> > 1360
> >
> > For third RUNNING_XACT, we get following logs::
> > 2025-07-20 16:50:18.038 IST [1780326] port=5340
> > ubuntu(at)test_db/[unknown] LOG: logical decoding found consistent point
> > at 0/3CBCC58
> > 2025-07-20 16:50:18.038 IST [1780326] port=5340
> > ubuntu(at)test_db/[unknown] DETAIL: There are no old transactions
> > anymore.
> > 2025-07-20 16:50:18.038 IST [1780326] port=5340
> > ubuntu(at)test_db/[unknown] STATEMENT: SELECT lsn FROM
> > pg_catalog.pg_create_logical_replication_slot('sub', 'pgoutput',
> > false, false, false)
> >
> > waldump record:
> > rmgr: Standby len (rec/tot): 70/ 70, tx: 0, lsn:
> > 0/03CBCC58, prev 0/03CBCC18, desc: RUNNING_XACTS nextXid 1370
> > latestCompletedXid 1364 oldestRunningXid 1365; 5 xacts: 1366 1365 1369
> > 1368 1367
> > The consistent point is found at "0/3CBCC58".
> >
> > When slot is created the confirmed_flush is set inside function
> > "DecodingContextFindStartpoint" using:
> > slot->data.confirmed_flush = ctx->reader->EndRecPtr;
> > In our case the value of consistent_lsn is "0/03CBCCA0" (I added some
> > logs and got the value). Logs:
> > 2025-07-20 16:50:18.039 IST [1780326] port=5340
> > ubuntu(at)test_db/[unknown] LOG: #### confirmed_flush = 0/03CBCCA0
> > inside DecodingContextFindStartpoint
> > 2025-07-20 16:50:18.039 IST [1780326] port=5340
> > ubuntu(at)test_db/[unknown] STATEMENT: SELECT lsn FROM
> > pg_catalog.pg_create_logical_replication_slot('sub', 'pgoutput',
> > false, false, false)
> >
> > This consistent_lsn "0/03CBCCA0" is nothing but End of RUNNING_XACT (
> > whose start is "0/3CBCC58").
> >
> > While the slot is being created a transaction in a concurrent session
> > commits (just after the third RUNNING_XACT) and add a COMMIT log:
> > rmgr: Transaction len (rec/tot): 46/ 46, tx: 1369, lsn:
> > 0/03CBCCA0, prev 0/03CBCC58, desc: COMMIT 2025-07-20 16:50:18.031146
> > IST
> >
> > So, in such cases the consistent LSN can be set to a COMMIT record.
> >
> > > If we decide to fix in the way proposed here, then we also need to
> > > investigate whether we need an additional WAL record added by commit'
> > > 03b08c8f5f3e30c97e5908f3d3d76872dab8a9dc. The reason why that
> > > additional WAL record was added is discussed in email [1].
> > >
> > > [1] - https://www.postgresql.org/message-id/flat/2377319.1719766794%40sss.pgh.pa.us#bba9f5ee0efc73151cc521a6bd5182ed
> >
> > I reverted the change added by commit
> > 03b08c8f5f3e30c97e5908f3d3d76872dab8a9dc and applied my patch and
> > checked the behaviour. And I am able to reproduce the issue the commit
> > was resolving. I think this change is still required.
> > This change is still required because, while recovery is performed in
> > the function 'PerformWalRecovery', when recovery_target_inclusive is
> > set to false, function 'recoveryStopsBefore' is responsible to set
> > whether recovery is finished or not. This function will set
> > 'reachedRecoveryTarget' to true when it satisfy the condition
> > /* Check if target LSN has been reached */
> > if (recoveryTarget == RECOVERY_TARGET_LSN &&
> > !recoveryTargetInclusive &&
> > record->ReadRecPtr >= recoveryTargetLSN)
> > Here we are checking if "start of the record" >= recoveryTargetLSN.
> >
> > When a replication slot is created, consistent_lsn is obtained. Since
> > this consistent_lsn points to End of the record (or we can say start
> > of the next record), there can be a case that there is no WAL record
> > corresponding to the consistent lsn. So, during the recovery, it will
> > wait till it reads the record corresponding to consistent lsn (during
> > my testing this wait was around ~20 sec). And this wait can create the
> > timeout issue.
> > I have manually debugged and checked the above case and I think the
> > change in commit 03b08c8f5f3e30c97e5908f3d3d76872dab8a9dc is still
> > needed.
>
> I agree with your analysis.
>
> How about adding a comment in the following lines here:
> /* Set recovery_target_inclusive = false to avoid reapplying the
> transaction committed at consistent_lsn after subscription is enabled.
> */
>
> appendPQExpBufferStr(recoveryconfcontents,
> -
> "recovery_target_inclusive = true\n");
> +
> "recovery_target_inclusive = false\n");
> appendPQExpBufferStr(recoveryconfcontents,
>
> Also the master branch patch can be applied for PG18 branch, just
> master and branch PG17 branch patch is enough.
>
Thanks for reviewing the patch. I have added comments and attached the
updated patch.
v6_HEAD applies to HEAD and REL_18_STABLE. v6_REL_17 applies to REL_17_STABLE.

Thanks,
Shlok Kyal

Attachment	Content-Type	Size
v6_HEAD-0001-Fix-duplicate-insert-during-pg_createsubscri.patch	application/octet-stream	2.3 KB
v6_REL_17-0001-Fix-duplicate-insert-during-pg_createsubsc.patch	application/octet-stream	2.3 KB

In response to

Re: BUG #18897: Logical replication conflict after using pg_createsubscriber under heavy load at 2025-07-23 09:24:49 from vignesh C

Responses

Re: BUG #18897: Logical replication conflict after using pg_createsubscriber under heavy load at 2025-07-24 10:49:02 from Amit Kapila

Browse pgsql-bugs by date

	From	Date	Subject
Next Message	Tom Lane	2025-07-23 14:27:35	Re: BUG #18995: Building with GCC 14 fails: incompatible pointer struct pam_message
Previous Message	PG Bug reporting form	2025-07-23 10:00:55	BUG #18995: Building with GCC 14 fails: incompatible pointer struct pam_message