Quick Links

Re: Replication slot is not able to sync up

From:	Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>
To:	"Zhijie Hou (Fujitsu)" <houzj(dot)fnst(at)fujitsu(dot)com>
Cc:	Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Suraj Kharage <suraj(dot)kharage(at)enterprisedb(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject:	Re: Replication slot is not able to sync up
Date:	2025-05-28 06:25:18
Message-ID:	CAD21AoChZmhH70vikmiXH+MXt173PcCvioxtHA_MD1A_Apaq_Q@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

On Tue, May 27, 2025 at 9:15 PM Zhijie Hou (Fujitsu)
<houzj(dot)fnst(at)fujitsu(dot)com> wrote:
>
> On Wed, May 28, 2025 at 2:09 AM Masahiko Sawada wrote:
> >
> > On Fri, May 23, 2025 at 10:07 PM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
> > wrote:
> > >
> > > In the case presented here, the logical slot is expected to keep
> > > forwarding, and in the consecutive sync cycle, the sync should be
> > > successful. Users using logical decoding APIs should also be aware
> > > that if due for some reason, the logical slot is not moving forward,
> > > the master/publisher node will start accumulating dead rows and WAL,
> > > which can create bigger problems.
> >
> > I've tried this case and am concerned that the slot synchronization using
> > pg_sync_replication_slots() would never succeed while the primary keeps
> > getting write transactions. Even if the user manually consumes changes on the
> > primary, the primary server keeps advancing its XID in the meanwhile. On the
> > standby, we ensure that the
> > TransamVariables->nextXid is beyond the XID of WAL record that it's
> > going to apply so the xmin horizon calculated by
> > GetOldestSafeDecodingTransactionId() ends up always being higher than the
> > slot's catalog_xmin on the primary. We get the log message "could not
> > synchronize replication slot "s" because remote slot precedes local slot" and
> > cleanup the slot on the standby at the end of pg_sync_replication_slots().
>
> I think the issue occurs because unlike the slotsync worker, the SQL API
> removes temporary slots when the function ends, so it cannot hold back the
> standby's catalog_xmin. If transactions on the primary keep advancing xids, the
> source slot's catalog_xmin on the primary fails to catch up with the standby's
> nextXid, causing sync failure.

Agreed with this analysis.

> This only affects the initial sync when creating a new slot on the standby.
> Once the slot exists, the standby's catalog_xmin stabilizes, preventing the
> issue in subsequent syncs.

Right. I think this is an area where we can improve, if there is a
real use case.

> I think the SQL API was mainly intended for testing and debugging purposes
> where controlled sync operations are useful. For production use, the slotsync
> worker (with sync_replication_slots=on) is recommended because it automatically
> handles this problem and requires minimal manual intervention. But to avoid
> confusion, I think we should clearly document this distinction.

I didn't know it was intended for testing and debugging purposes so
clearilying it in the documentation would be a good idea. Also, I
agree that using the slotsync worker is the primary usage of this
feature. I'm interested in whether there is a use case where the SQL
API is more preferable. If there is, we can improve the SQL API part,
especially the first synchronization part, for v19 or later.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

In response to

RE: Replication slot is not able to sync up at 2025-05-28 04:15:49 from Zhijie Hou (Fujitsu)

Responses

Re: Replication slot is not able to sync up at 2025-05-29 03:09:25 from shveta malik

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Michael Paquier	2025-05-28 06:48:14	Re: [PATCH] PGSERVICEFILE as part of a normal connection string
Previous Message	jian he	2025-05-28 05:13:50	foreign key on virtual generated column