Re: Replication slot is not able to sync up

From: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Suraj Kharage <suraj(dot)kharage(at)enterprisedb(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: Replication slot is not able to sync up
Date: 2025-05-24 05:07:09
Message-ID: CAA4eK1Kcr8MCOEjVjp=bw6EaihAgSeDGjnftGXrYe6GXEw7NPg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Fri, May 23, 2025 at 11:25 PM Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
>
> On Fri, May 23, 2025 at 12:55 AM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
> > The remote_slot (slot on primary) should be advanced before you invoke sync_slot. Can you do pg_logical_slot_get_changes() API before performing sync? You can check the xmin of the logical slot after get_changes to ensure that xmin has moved to 765 in your case.
>
> I'm fairly dismayed by this example. I hope I'm misunderstanding
> something, because otherwise I have difficulty understanding how we
> thought it was OK to ship this feature in this condition.
>
> At the moment that pg_sync_replication_slots() is executed, a slot
> named failover_slot exists on only one of the two servers. How can you
> justify emitting an error message complaining that "remote slot
> precedes local slot"? There's only one slot! I understand that, under
> the hood, we probably created an additional slot on the standby and
> then tried to fast-forward it, and this error occurred in the second
> step. But a user shouldn't have to understand those kinds of internal
> implementation details to make sense of the error message.
>

Fair point.

>
If the
> problem is that we're not able to create a slot on the standby at an
> old enough LSN or XID position to permit its use with the
> corresponding slot on the master, it should be reported that way.
>

That is the case, and we should improve the LOG message. However, let
me first explain to you what is going on here. This happens because
the DDL is replicated before the pg_sync_replication_slots() call, due
to which the locally created slot on the standby will acquire an xmin
later (765) than the slot on the master (764). So, we can't sync in
that particular sync cycle because otherwise, we can't guarantee the
required rows will be present on the standby later when one tries to
use the slot.

IIUC, the users will use this feature where master (publisher) and
subscriber nodes are doing logical replication, and we want to keep
the corresponding logical slot's copy on the physical standby. So that
if the master goes down, then the subscriber can continue logical
replication from the physical standby. In such a setup, users won't
need to bother with such LOGs because even if we are not able to sync
the logical slot in a particular sync cycle and the LOG appears, we
should be able to sync in the next cycle.

In the case presented here, the logical slot is expected to keep
forwarding, and in the consecutive sync cycle, the sync should be
successful. Users using logical decoding APIs should also be aware
that if due for some reason, the logical slot is not moving forward,
the master/publisher node will start accumulating dead rows and WAL,
which can create bigger problems.

--
With Regards,
Amit Kapila.

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Amit Kapila 2025-05-24 05:30:16 Re: Conflict detection for update_deleted in logical replication
Previous Message Dilip Kumar 2025-05-24 04:58:50 Re: Conflict detection for update_deleted in logical replication