| From: | Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com> |
|---|---|
| To: | Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> |
| Cc: | Michael Paquier <michael(at)paquier(dot)xyz>, Postgres hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org> |
| Subject: | Re: Orphaned records in pg_replication_origin_status after subscription drop |
| Date: | 2025-12-19 21:56:38 |
| Message-ID: | CAD21AoA=28eFNFY5mYRBhiSj=Q4QJwM+3_K8_potXdAjXspYGA@mail.gmail.com |
| Views: | Whole Thread | Raw Message | Download mbox | Resend email |
| Thread: | |
| Lists: | pgsql-hackers |
On Fri, Dec 19, 2025 at 2:18 AM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
>
> On Fri, Dec 19, 2025 at 10:42 AM Michael Paquier <michael(at)paquier(dot)xyz> wrote:
> >
> > Some colleagues have reported that it is possible to finish with
> > orphaned records in pg_replication_origin_status, as an effect of
> > table synchronization workers that miss some cleanup actions around
> > replorigin_drop_by_name() when a subscription that spawned these
> > workers (which themselves create origins through replorigin_create()
> > called in LogicalRepSyncTableStart()) is dropped.
> >
> > Please find attached a small script, courtesy of Tenglong Gu and
> > Daisuke Higuchi, that have been digging into all that.
> >
> > This problem is true since ce0fdbfe9722 that has added the creation of
> > a replication origin for table synchronization workers, as of 14 up to
> > HEAD. One issue with these origin states is that they are sticky: a
> > restart of the node that held the subscription keeps them around due
> > to them getting flushed in replorigin_checkpoint. If my understanding
> > is right, this also means that origin slots are still tracked as in
> > use, implying that replorigin_session_setup() would not be able to
> > reuse them as far as I understand.
> >
> > So, in summary (please correct me if necessary), DropSubscription()
> > attempts a round of cleanup for all the replication origins with a
> > loop over ReplicationOriginNameForLogicalRep(), and while in this case
> > we attempt to drop the origins created by the workers that are tracked
> > as orphaned, the syscache lookup of REPLORIGIDENT fails within the
> > DROP SUBSCRIPTION because replorigin_by_name() cannot find an entry:
> > the replication origin is already gone, but its state has persisted in
> > memory. Then, why would the replication origin be already gone with
> > its state in memory not cleaned up yet? Well, the problematic test
> > case shows that the subscription is dropped while the spawned
> > tablesync workers do the initial table copy, and the replication
> > origin is created in the same transaction as the COPY. DROP
> > SUBSCRIPTION tells the tablesync workers to stop, they stop with the
> > COPY failing and the origin is not seen as something that exists for
> > the session that drops the subscription. The replication state in
> > memory goes out of sync.
> >
> > So it looks like to me that we are missing a code path where the
> > replication origin is dropped but we just ignore to reset the
> > replication state in memory, leading to bloat of the origin slots
> > available. worker.c does things differently: a small transaction is
> > used for the origin creation, hence we would never really see the
> > replication state memory going out-of-sync with the replorigin
> > catalog.
IIUC the similar issue happens also when the tablesync worker simply
errors out for some reason (e.g., conflicts), but in this case the
tablesync worker would retry from the scratch. While the replication
origin record is deleted during rollback, the origin slot is still
in-use but not acquired by anyone. After the tablesync worker
restarts, it would get a new origin ID and might re-use the orphaned
slot if the origin ID matches.
> >
> > One solution would be for tablesync.c to do the same as worker.c: we
> > could use a short transaction that sets the memory state and creates
> > the replication origin, then move to a second transaction for the
> > COPY.
Also, the tablesync worker should accept the case where the
replication origin already exists for the error cases instead of
raising an error:
ereport(ERROR,
(errcode(ERRCODE_DUPLICATE_OBJECT),
errmsg("replication origin \"%s\" already exists",
originname)));
> > A second solution, slightly more complicated, is to create some
> > specific logic to reset the progress state of the origin that's been
> > created in the transaction that runs the initial COPY, which I guess
> > should be in the shape of a transaction abort callback.
I find that it's too much change for backpatching.
> So, I have a slight
> preference towards PG_ENSURE_ERROR_CLEANUP solution where the callback
> should clear origin state via replorigin_state_clear.
It would work too. Or I think we can do a similar thing in
replorigin_reset() for tablesync workers who are in
SUBREL_STATE_DATASYNC state. Both ways require exposing
replorigin_state_clear(), though.
I am slightly leaning towards the idea of using a short transaction
because the tablesync worker would do things closer to the apply
workers and it would also fix the odd behavior that
pg_replication_origin_status shows NULL in the external_id column for
the origins while the COPY is being executed. But we need to verify if
it's really okay to reuse the existing origin instead of raising an
error in case where the tablesync worker retries to the data copy.
Regards,
--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com
| From | Date | Subject | |
|---|---|---|---|
| Next Message | Michael Paquier | 2025-12-19 22:43:06 | Re: Orphaned records in pg_replication_origin_status after subscription drop |
| Previous Message | Sami Imseih | 2025-12-19 21:36:16 | Re: Refactor query normalization into core query jumbling |