From: | Shlok Kyal <shlok(dot)kyal(dot)oss(at)gmail(dot)com> |
---|---|
To: | vignesh C <vignesh21(at)gmail(dot)com> |
Cc: | Euler Taveira <euler(at)eulerto(dot)com>, duffieldzane(at)gmail(dot)com, pgsql-bugs(at)lists(dot)postgresql(dot)org |
Subject: | Re: BUG #18897: Logical replication conflict after using pg_createsubscriber under heavy load |
Date: | 2025-05-10 19:19:38 |
Message-ID: | CANhcyEUuF1VMLXbkwCVNc0ACAgrbkcADJTfEq9wqEWiLa6sfhQ@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-bugs |
On Mon, 5 May 2025 at 20:17, vignesh C <vignesh21(at)gmail(dot)com> wrote:
>
> On Tue, 29 Apr 2025 at 13:17, Shlok Kyal <shlok(dot)kyal(dot)oss(at)gmail(dot)com> wrote:
> >
> > On Mon, 28 Apr 2025 at 10:28, vignesh C <vignesh21(at)gmail(dot)com> wrote:
> > >
> > > On Thu, 24 Apr 2025 at 11:57, Shlok Kyal <shlok(dot)kyal(dot)oss(at)gmail(dot)com> wrote:
> > > >
> > > > On Thu, 24 Apr 2025 at 09:08, vignesh C <vignesh21(at)gmail(dot)com> wrote:
> > > > >
> > > > > On Wed, 23 Apr 2025 at 07:29, Euler Taveira <euler(at)eulerto(dot)com> wrote:
> > > > > >
> > > > > > On Wed, Apr 16, 2025, at 8:14 PM, PG Bug reporting form wrote:
> > > > > >
> > > > > > I'm in the process of converting our databases from pglogical logical
> > > > > > replication to the native logical replication implementation on PostgreSQL
> > > > > > 17. One of the bugs we encountered and had to work around with pglogical was
> > > > > > the plugin dropping records while converting to a streaming replica to
> > > > > > logical via pglogical_create_subscriber (reported
> > > > > > https://github.com/2ndQuadrant/pglogical/issues/349) I was trying to
> > > > > > confirm that the native logical replication implementation did not have this
> > > > > > problem, and I've found that it might have a different problem.
> > > > > >
> > > > > >
> > > > > > pg_createsubscriber uses a different approach than pglogical. While pglogical
> > > > > > uses a restore point, pg_createsubscriber uses the LSN from the latest
> > > > > > replication slot as a replication start point. The restore point approach is
> > > > > > usually suitable to physical replication but might not cover all scenarios for
> > > > > > logical replication (such as when there are in progress transactions). Since
> > > > > > creating a logical replication slot does find a consistent decoding start
> > > > > > point, it is a natural choice to start the logical replication (that also needs
> > > > > > to find a decoding start point).
> > > > >
> > > > > I observed a difference between logical replication and
> > > > > pg_createsubscriber in how the replication origin is set, which can
> > > > > lead to different behaviors during replication restarts.
> > > > >
> > > > > Consider the following WAL records:
> > > > > rmgr: Transaction len (rec/tot): 46/ 46, tx: 767, lsn:
> > > > > 0/01756868, prev 0/01756780, desc: COMMIT 2025-04-23 15:49:24.349942
> > > > > IST
> > > > > rmgr: Standby len (rec/tot): 54/ 54, tx: 0, lsn:
> > > > > 0/01756898, prev 0/01756868, desc: RUNNING_XACTS nextXid 768
> > > > > latestCompletedXid 767 oldestRunningXid 755; 1 xacts: 755
> > > > >
> > > > > Behavior in Logical Replication
> > > > > In logical replication, after applying the COMMIT record at LSN
> > > > > 0/01756868, the replication origin is set to the end LSN of the commit
> > > > > record, i.e., 0/01756898. On restart, replication resumes from the
> > > > > next LSN after COMMIT records, which is 0/01756898 in this case.
> > > > > As a result, the same transaction is not sent again, and duplicate
> > > > > data is avoided.
> > > > >
> > > > > Behavior in pg_createsubscriber
> > > > > However, in the case of pg_createsubscriber, the consistent point used
> > > > > for creating the replication slot on the publisher may be set exactly
> > > > > at the commit LSN (0/01756868, xid 767). When promoting the standby,
> > > > > this same LSN is used as recovery_target_lsn, so the standby recovers
> > > > > up to and including the commit of transaction 767.
> > > > >
> > > > > After promotion, if the replication origin is also set to this same
> > > > > commit LSN, the subscriber will request changes starting from that
> > > > > point. Since the origin doesn't reflect the commit as applied,
> > > > > transaction 767 gets replicated again, leading to duplicate data and
> > > > > possible replication failure.
> > > > >
> > > > > If the issue is not reproducible using Zane's suggested steps, we can
> > > > > try the following:
> > > > > 1) Stop the standby server.
> > > > > 2) Perform an insert transaction and note the commit LSN using pg_waldump.
> > > > > 3) Set up the publisher, replication slot, etc., simulating the
> > > > > pg_createsubscriber behavior.
> > > > > 4) Restart the standby with promotion configuration from
> > > > > setup_recovery(), setting recovery_target_lsn to the commit LSN from
> > > > > step 2.
> > > > > 5) Create the subscription and set the replication origin to the same
> > > > > commit LSN.
> > > > > 6) Enable the subscription.
> > > > >
> > > > > This setup should reproduce the issue where the transaction gets
> > > > > applied twice on the subscriber due to the replication origin being
> > > > > aligned with the commit LSN rather than its end.
> > > > >
> > > > > Thoughts?
> > > >
> > > > Hi Vignesh,
> > > >
> > > > I have verified the behaviour shared by you for Logical Replication
> > > > and pg_createsubscriber and agree with your analysis.
> > > > I have also tried the steps shared by you and am able to reproduce the issue.
> > > >
> > > > I am thinking of resolving it by introducing a new API, which can give
> > > > us the next lsn to the lsn provided.
> > > > During pg_createsusbcriber run where we are advancing the replication
> > > > origin to 'consistent_lsn'. I think we should advance it to the next
> > > > lsn of 'consistent_lsn' instead.
> > > > I think this will resolve the issue. Thoughts?
> > >
> > > With this approach, there is a risk of starting from the next WAL
> > > record after the consistent point. For example, if the slot returns a
> > > consistent point at 0/1715E10, after the fix we would begin replaying
> > > from the next WAL record, such as 0/1715E40, which could potentially
> > > lead to data loss.
> > > As an alternative, we could set recovery_target_inclusive to false in
> > > the setup_recovery function. This way, recovery would stop just before
> > > the recovery target, allowing the publisher to start replicating
> > > exactly from the consistent point.
> > > Thoughts?
> >
> > This approach looks better to me.
> > I have prepared the patch for the same.
> >
> > With the patch I have run the test script in [1]. It ran for ~7hrs and
> > it did not reproduce the issue.
>
> Can you check and see that the original scenario gets verified, that
> is the consistent lsn record type is COMMIT. You can use the patch
> attached which will print the consistent lsn record type. Make sure to
> create the pg_walinspect extension before running pg_createsubscriber
> as the pg_walinspect's pg_get_wal_record_info function is used.
>
Hi Vignesh,
Thanks for providing the patch. I have verified the original case with
the patch on both HEAD and PG_17.
With the changes in [1], I noticed that even if the consistent lsn is
set to 'COMMIT' record, we do not get data mismatches or the primary
key conflict.
I have also attached the logs for test runs with your changes.
I have also attached the patches which apply on HEAD and PG_17.
Thanks and Regards,
Shlok Kyal
Attachment | Content-Type | Size |
---|---|---|
test.log | application/octet-stream | 313.6 KB |
v3_HEAD-0001-Fix-duplicate-insert-during-pg_createsubscri.patch | application/octet-stream | 2.1 KB |
v3_REL_17-0001-Fix-duplicate-insert-during-pg_createsubsc.patch | application/octet-stream | 2.1 KB |
From | Date | Subject | |
---|---|---|---|
Next Message | Dilip Kumar | 2025-05-11 08:32:39 | Re: BUG #18921: The larger the value of max_parallel_workers_per_gather, the longer the SQL execution time. |
Previous Message | PG Bug reporting form | 2025-05-10 10:10:18 | BUG #18921: The larger the value of max_parallel_workers_per_gather, the longer the SQL execution time. |