From: | "Zhijie Hou (Fujitsu)" <houzj(dot)fnst(at)fujitsu(dot)com> |
---|---|
To: | Dilip Kumar <dilipbalaut(at)gmail(dot)com> |
Cc: | Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Suraj Kharage <suraj(dot)kharage(at)enterprisedb(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org> |
Subject: | RE: Replication slot is not able to sync up |
Date: | 2025-06-16 03:54:18 |
Message-ID: | OS0PR01MB5716F14E904A5CB06053AD6D9470A@OS0PR01MB5716.jpnprd01.prod.outlook.com |
Views: | Whole Thread | Raw Message | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On Sat, Jun 14, 2025 at 11:37 PM Dilip Kumar wrote:
>
> On Fri, May 30, 2025 at 3:38 PM Zhijie Hou (Fujitsu) <houzj(dot)fnst(at)fujitsu(dot)com>
> wrote:
> >
> > On Wed, May 28, 2025 at 2:09 AM Masahiko Sawada wrote:
> > >
> > > On Fri, May 23, 2025 at 10:07 PM Amit Kapila
> > > <amit(dot)kapila16(at)gmail(dot)com>
> > > wrote:
> > > >
> > > > In the case presented here, the logical slot is expected to keep
> > > > forwarding, and in the consecutive sync cycle, the sync should be
> > > > successful. Users using logical decoding APIs should also be aware
> > > > that if due for some reason, the logical slot is not moving
> > > > forward, the master/publisher node will start accumulating dead
> > > > rows and WAL, which can create bigger problems.
> > >
> > > I've tried this case and am concerned that the slot synchronization
> > > using
> > > pg_sync_replication_slots() would never succeed while the primary
> > > keeps getting write transactions. Even if the user manually consumes
> > > changes on the primary, the primary server keeps advancing its XID
> > > in the meanwhile. On the standby, we ensure that the
> > > TransamVariables->nextXid is beyond the XID of WAL record that it's
> > > going to apply so the xmin horizon calculated by
> > > GetOldestSafeDecodingTransactionId() ends up always being higher
> > > than the slot's catalog_xmin on the primary. We get the log message
> > > "could not synchronize replication slot "s" because remote slot
> > > precedes local slot" and cleanup the slot on the standby at the end of
> pg_sync_replication_slots().
> >
> > To improve this workload scenario, we can modify
> > pg_sync_replication_slots() to wait for the primary slot to advance to
> > a suitable position before completing synchronization and removing the
> > temporary slot. This would allow the sync to complete as soon as the
> > primary slot advances, whether through
> > pg_logical_xx_get_changes() or other ways.
> >
> > I've created a POC (attached) that currently waits indefinitely for
> > the remote slot to catch up. We could later add a timeout parameter to
> > control maximum wait time if this approach seems acceptable.
> >
> > I tested that, when pgbench TPC-B is running on the primary, calling
> > pg_sync_replication_slots() on the standby correctly blocks until I
> > advance the primary slot position by calling pg_logical_xx_get_changes().
> >
> > if the basic idea sounds reasonable then I can start a separate thread
> > to extend this API. Thoughts ?
>
> IMHO, this idea has merit, have you started a thread for reviewing this patch?
Thank you for looking at it. I plan to start a new thread soon for the
upcoming commit fest, after some additional testing and documentation cleanup.
Best Regards,
Hou zj
From | Date | Subject | |
---|---|---|---|
Next Message | shveta malik | 2025-06-16 03:56:54 | Re: Replication slot is not able to sync up |
Previous Message | Peter Smith | 2025-06-16 03:48:06 | Re: [WIP]Vertical Clustered Index (columnar store extension) - take2 |