Re: Separate catalog_xmin from xmin in walsender hot standby feedback

From: Rui Zhao <zhaorui126(at)gmail(dot)com>
To: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
Cc: pgsql-hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, andres(at)anarazel(dot)de
Subject: Re: Separate catalog_xmin from xmin in walsender hot standby feedback
Date: 2026-06-15 02:48:38
Message-ID: 57E2263F-455D-49A5-A16D-E66562DBC027@gmail.com
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi Amit,

Thanks for the careful review. Following your suggestion to explore the
ephemeral-slot direction hinted at in the XXX comment in walsender.c
(ProcessStandbyHSFeedbackMessage), I built a working prototype and ran it
through the design questions you raised. Sharing what I found, since it
changes my recommendation.

== The prototype ==

When a standby connects WITHOUT a replication slot and sends hot standby
feedback, the walsender now lazily creates an *ephemeral* physical slot
(RS_EPHEMERAL, named pg_walsender_<pid>) on first feedback, and tracks the
two horizons through it:

if (MyReplicationSlot == NULL)
ReplicationSlotCreate("pg_walsender_<pid>", false, RS_EPHEMERAL,
false, false, false, false,
/* error_if_full */ false);

if (MyReplicationSlot != NULL)
PhysicalReplicationSlotNewXmin(feedbackXmin, feedbackCatalogXmin);
else
/* graceful degradation -- see below */

This directly addresses both of your concerns:

(a) Per-PGPROC memory: nothing is added to PGPROC. The 4 bytes/proc that
the previous patch spent on every backend are gone; only walsenders
that actually receive feedback consume anything, and they consume a
slot they were arguably entitled to all along.

(b) Torn read between xmin and catalog_xmin: PhysicalReplicationSlotNewXmin
updates effective_xmin and effective_catalog_xmin together under
slot->mutex, so ComputeXidHorizons never observes a half-updated pair.
The atomicity falls out of the existing slot machinery for free.

== WAL retention ==

The obvious worry with a slot is restart_lsn pinning WAL. In practice this
is bounded and, I'd argue, desirable:

- The slot is ephemeral, so it is dropped automatically when the walsender
exits (before_shmem_exit -> ReplicationSlotShmemExit ->
ReplicationSlotRelease), including on standby disconnect/crash and on
crash-restart. There is no persistent slot left behind to pile up WAL
indefinitely -- which was the historical reason this path avoided slots.

- While the standby is connected, restart_lsn keeps the WAL the standby
still needs. That is precisely the behavior we want: today the no-slot
path can let the primary recycle WAL out from under a perfectly healthy
standby. Pinning it is a feature, not a regression.

- The worst case (connected but hopelessly lagging) is bounded by
max_slot_wal_keep_size, and a dead-but-unnoticed standby is bounded by
wal_sender_timeout (default 60s) before the walsender exits and the slot
drops.

== Slot-pool exhaustion ==

The one real pitfall is that ephemeral slots consume max_replication_slots.
If the pool is exhausted, ReplicationSlotCreate would ereport(ERROR), which
in the walsender would tear down the replication connection -- a regression
for feedback-only standbys that work fine today.

I solved this by adding an error_if_full flag to ReplicationSlotCreate
(default-equivalent true for all existing callers; behavior unchanged). When
false, a full pool releases ReplicationSlotAllocationLock and returns false
instead of erroring, and the walsender degrades gracefully to the pre-existing
behavior -- holding back the older of the two horizons via its own PGPROC
xmin:

/* slot pool exhausted: never break the connection */
MyProc->xmin = min(feedbackXmin, feedbackCatalogXmin);

So the separation is best-effort: you get the catalog/data split whenever a
slot is available, and you fall back to exactly today's behavior when it is
not. No connection is ever broken by this change.

== Visibility ==

One user-facing consequence worth flagging: each feedback standby now surfaces
a pg_walsender_<pid> physical slot in pg_replication_slots while it is
connected (it disappears when the walsender exits and the ephemeral slot is
dropped). Note the temporary column reads 'f' there, since the slot is
RS_EPHEMERAL rather than RS_TEMPORARY, so to a casual reader it looks like an
ordinary persistent physical slot distinguished only by its name prefix.

This is harmless functionally, but monitoring that counts slots or alerts on
"unexpected" slots will see these appear, and the 'f' in temporary may
mislead someone into thinking they won't be cleaned up automatically (they
are). The PGPROC approach has no such surface -- the horizon is held silently
on the walsender's proc entry. So this is a genuine trade-off between the two:
visibility/observability (you can actually see what each standby is pinning)
versus invisibility (nothing new shows up anywhere). I think surfacing it is
mostly a feature, but I wanted to call it out explicitly rather than have it
surprise anyone.

== Validation ==

I tested this on a live primary/standby pair (rebased onto current master).
A standby connects with no slot and hot_standby_feedback = on, and a logical
slot on the standby drives a catalog_xmin distinct from the data xmin. On the
primary, pg_replication_slots then shows:

slot_name | pg_walsender_3908747
slot_type | physical
xmin | 696 <- data horizon
catalog_xmin | 695 <- catalog horizon

So the two horizons are tracked separately, exactly as intended. Before this
change the slotless path would have collapsed them into MyProc->xmin =
min(696, 695) = 695, needlessly holding the data-table vacuum horizon back by
one. With the patch the data horizon stays at 696 and only catalog vacuum is
held at 695.

I also confirmed the lifecycle: the ephemeral slot is created lazily on first
feedback, and it is dropped automatically when the standby disconnects (the
slot count on the primary returns to zero), so there is no WAL pile-up left
behind.

== Recommendation ==

Given the above, I lean toward the ephemeral-slot approach over the PGPROC
approach: it removes the per-proc cost, gets atomicity for free, and reuses
existing, well-understood slot machinery instead of introducing a parallel
horizon-tracking path. The new surface is small: the error_if_full flag, plus
the visibility trade-off noted above (a pg_walsender_<pid> slot per connected
feedback standby, versus the PGPROC approach holding the horizon silently).

The attached prototype compiles cleanly and passes the live test above on
current master. Before I post a formal patch I wanted your read on the
direction -- in particular whether consuming (and exposing) a
max_replication_slots entry per feedback standby is acceptable given the
graceful fallback, or whether you'd prefer the pool accounting handled
differently.

Thanks again,
Rui

Attachment Content-Type Size
v1-0001-Separate-catalog_xmin-from-xmin-via-ephemeral-phy.patch application/octet-stream 8.9 KB
unknown_filename text/plain 2.9 KB

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Michael Paquier 2026-06-15 02:49:35 Re: [PATCH v1] Fix typo in InitWalRecovery() comment
Previous Message Michael Paquier 2026-06-15 02:46:36 Re: Use \if/\endif to remove non-libxml2 expected output in regression tests