| From: | Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> |
|---|---|
| To: | Michael Paquier <michael(at)paquier(dot)xyz> |
| Cc: | Postgres hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org> |
| Subject: | Re: Orphaned records in pg_replication_origin_status after subscription drop |
| Date: | 2025-12-19 10:18:21 |
| Message-ID: | CAA4eK1JNv8PWR-5A5xqeCebwoh4e6juPmiSAC1AFiGGx+Fdtjg@mail.gmail.com |
| Views: | Whole Thread | Raw Message | Download mbox | Resend email |
| Thread: | |
| Lists: | pgsql-hackers |
On Fri, Dec 19, 2025 at 10:42 AM Michael Paquier <michael(at)paquier(dot)xyz> wrote:
>
> Some colleagues have reported that it is possible to finish with
> orphaned records in pg_replication_origin_status, as an effect of
> table synchronization workers that miss some cleanup actions around
> replorigin_drop_by_name() when a subscription that spawned these
> workers (which themselves create origins through replorigin_create()
> called in LogicalRepSyncTableStart()) is dropped.
>
> Please find attached a small script, courtesy of Tenglong Gu and
> Daisuke Higuchi, that have been digging into all that.
>
> This problem is true since ce0fdbfe9722 that has added the creation of
> a replication origin for table synchronization workers, as of 14 up to
> HEAD. One issue with these origin states is that they are sticky: a
> restart of the node that held the subscription keeps them around due
> to them getting flushed in replorigin_checkpoint. If my understanding
> is right, this also means that origin slots are still tracked as in
> use, implying that replorigin_session_setup() would not be able to
> reuse them as far as I understand.
>
> So, in summary (please correct me if necessary), DropSubscription()
> attempts a round of cleanup for all the replication origins with a
> loop over ReplicationOriginNameForLogicalRep(), and while in this case
> we attempt to drop the origins created by the workers that are tracked
> as orphaned, the syscache lookup of REPLORIGIDENT fails within the
> DROP SUBSCRIPTION because replorigin_by_name() cannot find an entry:
> the replication origin is already gone, but its state has persisted in
> memory. Then, why would the replication origin be already gone with
> its state in memory not cleaned up yet? Well, the problematic test
> case shows that the subscription is dropped while the spawned
> tablesync workers do the initial table copy, and the replication
> origin is created in the same transaction as the COPY. DROP
> SUBSCRIPTION tells the tablesync workers to stop, they stop with the
> COPY failing and the origin is not seen as something that exists for
> the session that drops the subscription. The replication state in
> memory goes out of sync.
>
> So it looks like to me that we are missing a code path where the
> replication origin is dropped but we just ignore to reset the
> replication state in memory, leading to bloat of the origin slots
> available. worker.c does things differently: a small transaction is
> used for the origin creation, hence we would never really see the
> replication state memory going out-of-sync with the replorigin
> catalog.
>
> One solution would be for tablesync.c to do the same as worker.c: we
> could use a short transaction that sets the memory state and creates
> the replication origin, then move to a second transaction for the
> COPY. A second solution, slightly more complicated, is to create some
> specific logic to reset the progress state of the origin that's been
> created in the transaction that runs the initial COPY, which I guess
> should be in the shape of a transaction abort callback. All these
> symptoms are pointing out that it is not a good idea, IMO, to expect a
> replication origin to exist when dropping a subscription when an
> origin has been created in the context of a transaction that can take
> a very long time to run, aka the initial tablesync's COPY, so using a
> short transaction feels like a good thing to do here.
>
Yeah, either of these ways are okay. The only minor point in creating
replication origin in a separate transaction is that the other part of
operations on origin still need be done at the current place, so the
origin handling will look a bit separated. So, I have a slight
preference towards PG_ENSURE_ERROR_CLEANUP solution where the callback
should clear origin state via replorigin_state_clear.
--
With Regards,
Amit Kapila.
| From | Date | Subject | |
|---|---|---|---|
| Next Message | Bertrand Drouvot | 2025-12-19 10:26:55 | Re: Fix memory leak in gist_page_items() of pageinspect |
| Previous Message | Zsolt Parragi | 2025-12-19 10:13:32 | Re: Fix typo 586/686 in atomics/arch-x86.h |