Archive-fed logical decoding: pausing recovery on slot conflict

From: Andrey Borodin <x4mmm(at)yandex-team(dot)ru>
To: pgsql-hackers mailing list <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Cc: Kirk Wolak <wolakk(at)gmail(dot)com>, nik(at)postgres(dot)ai
Subject: Archive-fed logical decoding: pausing recovery on slot conflict
Date: 2026-06-05 07:06:48
Message-ID: 967DBEA9-E8B7-4705-AD36-447D839AACA9@yandex-team.ru
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi hackers!

We would like to get the community's opinion on an architecture for
running logical decoding on a standby that is fed only by the WAL
archive (restore_command), with no streaming link to the primary. The
goal is continuous CDC into analytics systems straight from an archive,
without adding load or a feedback dependency on the primary.

Three AI-written commits are attached. They are meant to make the idea
concrete and testable for this discussion, not as a finished patch yet.
We made several iterations of editorialization and sharpening idea and
implementation.

The problem
-----------

Logical decoding on a standby (PG16+) keeps the catalog readable for
decoding by holding back catalog_xmin, and it relies on the primary
holding its own catalog_xmin back via hot_standby_feedback over a
streaming connection. An archive-fed standby has no walreceiver and
therefore no feedback channel: the primary vacuums catalog tuples
freely, ships the resulting WAL to the archive, and when the standby
replays those records they conflict with the local logical slot's
catalog_xmin. The slot is invalidated, in practice roughly
2 * autovacuum_naptime after it is created.

As far as we can tell, this catalog conflict is the only fundamental
blocker for archive-fed decoding. All such conflicts funnel through a
single choke point, ResolveRecoveryConflictWithSnapshot(), which
invalidates logical slots only for catalog relations:

if (IsLogicalDecodingEnabled() && isCatalogRel)
InvalidateObsoleteReplicationSlots(RS_INVAL_HORIZON, ...);

The records that reach it with a catalog conflict horizon are:

* Heap2 PRUNE records on catalog relations
(PRUNE_ON_ACCESS, PRUNE_VACUUM_SCAN, PRUNE_VACUUM_CLEANUP;
flag XLHP_IS_CATALOG_REL)
* B-tree delete and page-reuse records on catalog indexes
(isCatalogRel)

System catalogs are indexed only with B-tree, so in practice the index
side is always B-tree; the other AMs' vacuum records route through the
same choke point but never carry a catalog horizon here.

The remaining logical-slot invalidation causes are not fundamental to
archive-fed decoding and are already in the DBA's hands:
RS_INVAL_WAL_LEVEL (set wal_level=logical on the primary) and
RS_INVAL_WAL_REMOVED (retain enough WAL on the standby, e.g. via
max_slot_wal_keep_size). Both are pre-existing knobs an operator
already manages.

Please correct us if there is a second fundamental obstacle we have
missed -- that is one of the main things we would like to confirm.

The proposed approach
---------------------

Instead of supplying the missing feedback, absorb the conflict on the
consumer side. A new GUC, recovery_pause_on_logical_slot_conflict
(default off), changes what happens at that choke point: when replay is
about to invalidate an active logical slot, recovery pauses instead.
Recovery resumes as soon as no slot still blocks the conflict. In the
common case that happens the moment the consumer's decoding advances the
slot's catalog_xmin past the conflict horizon, which can be well before
the consumer reaches the pause LSN; only a slot that is still holding
catalog_xmin back (e.g. a long-running decoded transaction) has to be
drained all the way to the pause LSN. For any slot that did drain to the
pause LSN, recovery advances its catalog_xmin past the horizon so the
following InvalidateObsoleteReplicationSlots() is a no-op; replay then
continues to the next conflict.

* The hot path when the GUC is off is a single boolean early-return.
* It reuses the existing SetRecoveryPause / recoveryNotPausedCV
machinery; no new shared memory.
* Auto-resume: a periodic re-scan lets replay continue the moment
nothing blocks the conflict. A slot stops blocking when its
catalog_xmin advances past the conflict horizon (the normal path,
via the consumer decoding/confirming, often before the pause LSN),
or when it drains past the pause LSN, is dropped, advanced, or
invalidated for another reason. This lets it run as an unattended
service. pg_wal_replay_resume() remains a manual "give up on this
slot and let it invalidate" escape hatch, and pg_promote() still
breaks out via CheckForStandbyTrigger().
* Crash safety: after advancing catalog_xmin in memory, dirty slots
are flushed with CheckPointReplicationSlots(false) before replay
proceeds, upholding the write-before-memory-update invariant that
LogicalConfirmReceivedLocation already relies on.

How we test it
--------------

The in-tree TAP test (054_recovery_pause_on_slot_conflict.pl) builds a
workload designed to break a logical slot, and checks that it breaks
without the feature and survives with it:

1. Bring up an archive-only standby from a basebackup whose archive
contains a standby snapshot but no catalog-prune WAL yet, and
create a logical slot while it is still consistent.
2. Churn the primary's catalog (transient tables, ANALYZE, VACUUM of
pg_class / pg_attribute / pg_statistic, etc.) so the archive then
carries catalog-prune records whose horizon overtakes the slot.
3. Run two standbys from the same archive: with the GUC off the slot
is invalidated (the upstream behaviour, and a check that the test
actually reproduces the conflict); with the GUC on a drain-and-
resume loop keeps the slot alive and decodes the full change
stream.

A third standby checks that an explicit operator pg_wal_replay_pause()
is not cleared by the GUC's auto-resume.

We also ran an end-to-end field test outside core, on a real
archive-only standby recovering from object storage via WAL-G, with a
pgbench workload plus deliberate catalog churn on the primary and a
pg_recvlogical consumer on the standby. The consumer was forced to lag
so the slot fell behind the prune horizon; recovery paused, the
consumer caught up, recovery auto-resumed, and the full change stream
arrived with no gaps -- while a GUC-off control standby lost its slot.
Design and results [0].

Questions for the list
----------------------

1. Direction. Is "pause recovery until the consumer catches up" an
acceptable shape for this, or is the right long-term answer a
feedback mechanism that does not require a streaming connection
(e.g. an out-of-band way to publish a decoding standby's catalog_xmin
back to the primary)? Pausing trades standby freshness for slot
survival, which is fine for a dedicated decoding replica but bad for
one that is also an HA target.

2. API. A cluster-wide GUC that stalls all of replay for the benefit
of one slot is coarse. Would a per-slot property be cleaner --
e.g. a slot option that opts that slot into "hold recovery rather
than invalidate me", so unrelated standbys and slots are unaffected?
That also makes the backpressure explicit: a slow consumer on one
slot deliberately holds the standby back.

3. Writing slot fields from the startup process. Advancing
catalog_xmin during recovery, and flushing slots from the startup
process, is new. Does the crash-safety argument above hold up,
and are there concurrency concerns beyond the synced-slot and
in-progress-snapbuild cases the commits already skip?

4. Fit with existing CDC installations. From a consumer's point of
view (pg_recvlogical, Debezium, etc.) this looks like ordinary
streaming from a standby with occasional stalls, so it should drop
into existing pipelines. Is that the right integration point, or
would operators prefer a separate "decoding from archive" tool that
never runs a full standby at all?

We are most interested in (1) and (2): whether this is the right layer
to solve the problem, and whether the interface can be made narrower
and less surprising than a global recovery pause.

Best regards, Andrey, Kirk, Nik.

[0] https://github.com/NikolayS/postgres/issues/43

Attachment Content-Type Size
v1-0001-xlogrecovery-make-ConfirmRecoveryPaused-and-Check.patch application/octet-stream 3.5 KB
v1-0003-Auto-resume-recovery-once-the-logical-slot-confli.patch application/octet-stream 26.5 KB
v1-0002-Add-recovery_pause_on_logical_slot_conflict-GUC.patch application/octet-stream 31.8 KB

Browse pgsql-hackers by date

  From Date Subject
Next Message Peter Eisentraut 2026-06-05 07:10:47 Re: Fix DROP PROPERTY GRAPH "unsupported object class" error
Previous Message shaobo zhang 2026-06-05 07:01:59 Fix missing semicolon in pl_gram.y for option_value rule