Orphaned records in pg_replication_origin_status after subscription drop

From: Michael Paquier <michael(at)paquier(dot)xyz>
To: Postgres hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Cc: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
Subject: Orphaned records in pg_replication_origin_status after subscription drop
Date: 2025-12-19 05:11:45
Message-ID: aUTekQTg4OYnw-Co@paquier.xyz
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi all,
(Added Amit K. in CC.)

Some colleagues have reported that it is possible to finish with
orphaned records in pg_replication_origin_status, as an effect of
table synchronization workers that miss some cleanup actions around
replorigin_drop_by_name() when a subscription that spawned these
workers (which themselves create origins through replorigin_create()
called in LogicalRepSyncTableStart()) is dropped.

Please find attached a small script, courtesy of Tenglong Gu and
Daisuke Higuchi, that have been digging into all that.

This problem is true since ce0fdbfe9722 that has added the creation of
a replication origin for table synchronization workers, as of 14 up to
HEAD. One issue with these origin states is that they are sticky: a
restart of the node that held the subscription keeps them around due
to them getting flushed in replorigin_checkpoint. If my understanding
is right, this also means that origin slots are still tracked as in
use, implying that replorigin_session_setup() would not be able to
reuse them as far as I understand.

So, in summary (please correct me if necessary), DropSubscription()
attempts a round of cleanup for all the replication origins with a
loop over ReplicationOriginNameForLogicalRep(), and while in this case
we attempt to drop the origins created by the workers that are tracked
as orphaned, the syscache lookup of REPLORIGIDENT fails within the
DROP SUBSCRIPTION because replorigin_by_name() cannot find an entry:
the replication origin is already gone, but its state has persisted in
memory. Then, why would the replication origin be already gone with
its state in memory not cleaned up yet? Well, the problematic test
case shows that the subscription is dropped while the spawned
tablesync workers do the initial table copy, and the replication
origin is created in the same transaction as the COPY. DROP
SUBSCRIPTION tells the tablesync workers to stop, they stop with the
COPY failing and the origin is not seen as something that exists for
the session that drops the subscription. The replication state in
memory goes out of sync.

So it looks like to me that we are missing a code path where the
replication origin is dropped but we just ignore to reset the
replication state in memory, leading to bloat of the origin slots
available. worker.c does things differently: a small transaction is
used for the origin creation, hence we would never really see the
replication state memory going out-of-sync with the replorigin
catalog.

One solution would be for tablesync.c to do the same as worker.c: we
could use a short transaction that sets the memory state and creates
the replication origin, then move to a second transaction for the
COPY. A second solution, slightly more complicated, is to create some
specific logic to reset the progress state of the origin that's been
created in the transaction that runs the initial COPY, which I guess
should be in the shape of a transaction abort callback. All these
symptoms are pointing out that it is not a good idea, IMO, to expect a
replication origin to exist when dropping a subscription when an
origin has been created in the context of a transaction that can take
a very long time to run, aka the initial tablesync's COPY, so using a
short transaction feels like a good thing to do here.

I have to admit that this code has become much more complicated over
the last couple of years, hence I may not have the full context of
how these cleanups actions should be achieved. They cleary should
happen though, but I am not completely sure which solution would be
better over the other. There may be an entirely different solution,
as well.

Note that it would not be complicated to create a regression test
based on an injection point: waiting during the COPY of a tablesync
worker would be enough while issuing a DROP SUBSCRIPTION.

So, thoughts or comments?
--
Michael

Attachment Content-Type Size
subpub_copy.txt text/plain 2.4 KB

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Peter Smith 2025-12-19 05:13:43 Re: POC: enable logical decoding when wal_level = 'replica' without a server restart
Previous Message shveta malik 2025-12-19 05:10:28 Re: Proposal: Conflict log history table for Logical Replication