Re: Introduce XID age based replication slot invalidation

From: Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>
To: Bharath Rupireddy <bharath(dot)rupireddyforpostgres(at)gmail(dot)com>
Cc: Srinath Reddy Sadipiralla <srinath2133(at)gmail(dot)com>, SATYANARAYANA NARLAPURAM <satyanarlapuram(at)gmail(dot)com>, "Hayato Kuroda (Fujitsu)" <kuroda(dot)hayato(at)fujitsu(dot)com>, John H <johnhyvr(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Introduce XID age based replication slot invalidation
Date: 2026-03-31 00:13:08
Message-ID: CAD21AoB4MsTpG5JEkifght_tQ91VHJO_8kpsDCrG-z9qkkmN5g@mail.gmail.com
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Sun, Mar 29, 2026 at 6:35 PM Bharath Rupireddy
<bharath(dot)rupireddyforpostgres(at)gmail(dot)com> wrote:
>
> Hi,
>
> On Sun, Mar 29, 2026 at 1:16 PM Srinath Reddy Sadipiralla
> <srinath2133(at)gmail(dot)com> wrote:
> >
> > Hello,
> >
> > Thanks for the v5 patch set, I have reviewed and did initial testing on
> > v5 patch set, and it LGTM, except these
>
> Thank you for reviewing and testing. I appreciate it.
>
> > diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c
> > index 286f0f46341..c2ff7e464f0 100644
> > --- a/src/backend/replication/slot.c
> > +++ b/src/backend/replication/slot.c
> > @@ -1849,7 +1849,7 @@ ReportSlotInvalidation(ReplicationSlotInvalidationCause cause,
> > else
> > {
> > /* translator: %s is a GUC variable name */
> > - appendStringInfo(&err_detail, _("The slot's xmin %u is %d transactions old, which exceeds the configured \"%s\" value of %d."),
> > + appendStringInfo(&err_detail, _("The slot's catalog_xmin %u is %d transactions old, which exceeds the configured \"%s\" value of %d."),
> > catalog_xmin, (int32) (recentXid - catalog_xmin), "max_slot_xid_age", max_slot_xid_age);
> > }
>
> Fixed the typo.
>
> > while testing the active slot XID age invalidation (SIGTERM path) , i
> > observed that slot got invalidated , walsender was killed because of
> > SIGTERM , then starts the infinite-retry-cycle problem where
> > walreceiver starts walsender and walsender will try to use an invalidated
> > slot and dies, will think more on this.
>
> I would like to clarify that once a slot is invalidated due to any of
> the reasons (ReplicationSlotInvalidationCause), it becomes unusable;
> the sender will error out if the receiver tries to use it. This is
> consistent with all existing slot invalidation mechanisms.
>
> Please find the attached v6 patches fixing the typo for further review.
>

I've reviewed the v6 patch. Here are some comments.

bool
vacuum_get_cutoffs(Relation rel, const VacuumParams params,
- struct VacuumCutoffs *cutoffs)
+ struct VacuumCutoffs *cutoffs,
+ TransactionId *slot_xmin,
+ TransactionId *slot_catalog_xmin)

How about storing both slot_xmin and catalog_xmin into VacuumCutoffs?

---
- if (InvalidateObsoleteReplicationSlots(RS_INVAL_WAL_REMOVED |
RS_INVAL_IDLE_TIMEOUT,
+ possibleInvalidationCauses = RS_INVAL_WAL_REMOVED | RS_INVAL_IDLE_TIMEOUT |
+ RS_INVAL_XID_AGE;
+
+ if (InvalidateObsoleteReplicationSlots(possibleInvalidationCauses,
_logSegNo, InvalidOid,
+ InvalidTransactionId,
+ max_slot_xid_age > 0 ?
+ ReadNextTransactionId() :
InvalidTransactionId))

It's odd to me that we specify RS_INVAL_XID_AGE while passing
InvalidTransactionId. I think we can specify RS_INVAL_XID_AGE along
with a valid recentXId only when we'd like to check the slots based on
their XIDs.

---
+ /* Check if the slot needs to be invalidated due to max_slot_xid_age GUC */
+ if ((possible_causes & RS_INVAL_XID_AGE) && CanInvalidateXidAgedSlot(s))
+ {
+ TransactionId xidLimit;
+
+ Assert(TransactionIdIsValid(recentXid));
+
+ xidLimit = TransactionIdRetreatedBy(recentXid, max_slot_xid_age);
+

I think we can avoid calculating xidLimit for every slot by
calculating it in InvalidatePossiblyObsoleteSlot() and passing it to
DetermineSlotInvalidationCause().

---
*/
TransactionId
GetOldestNonRemovableTransactionId(Relation rel)
+{
+ return GetOldestNonRemovableTransactionIdExt(rel, NULL, NULL);
+}
+
+/*
+ * Same as GetOldestNonRemovableTransactionId(), but also returns the
+ * replication slot xmin and catalog_xmin from the same ComputeXidHorizons()
+ * call. This avoids a separate ProcArrayLock acquisition when the caller
+ * needs both values.
+ */
+TransactionId
+GetOldestNonRemovableTransactionIdExt(Relation rel,
+ TransactionId *slot_xmin,
+ TransactionId *slot_catalog_xmin)
{

I understand that the primary reason why the patch introduces another
variant of GetOldestNonRemovableTransactionId() is to avoid extra
ProcArrayLock acquision to get replication slot xmin and catalog_xmin.
While it's not very elegant, I find that it would not be bad because
otherwise autovacuum takes extra ProcArrayLock (in shared mode) for
every table to vacuum. The ProcArrayLock is already known
high-contented lock it would be better to avoid taking it once more.
If others think differently, we can just call
ProcArrayGetReplicationSlotXmin() separately and compare them to the
limit of XID-age based slot invalidation.

Having said that, I personally don't want to add new instructions to
the existing GetOldestNonRemovableTransactionId(). I guess we might
want to make both the existing function and new function call a common
(inline) function that takes ComputeXidHorizonsResult and returns
appropriate transaction id based on the given relation .

---
+ # Do some work to advance xids
+ $node->safe_psql(
+ 'postgres', qq[
+ do \$\$
+ begin
+ for i in 1..$nxids loop
+ -- use an exception block so that each iteration eats an XID
+ begin
+ insert into $table_name values (i);
+ exception
+ when division_by_zero then null;
+ end;
+ end loop;
+ end\$\$;
+ ]);

I think it's fater to use pg_current_xact_id() instead.

---
+ else
+ {
+ $node->safe_psql('postgres', "VACUUM");
+ }

We don't need to vacuum all tables here.

---
+# Configure primary with XID age settings. Set autovacuum_naptime high so
+# that the checkpointer (not vacuum) triggers the invalidation.
+my $max_slot_xid_age = 500;
+$primary5->append_conf(
+ 'postgresql.conf', qq{
+max_slot_xid_age = $max_slot_xid_age
+autovacuum_naptime = '1h'
+});

I think that it's better to disable autovacuum than setting a large number.

---
+# Testcase end: Invalidate streaming standby's slot due to max_slot_xid_age
+# GUC (via checkpoint).

I think that we can say "physical slot" instead of standby's slot to
avoid confusion as I thought standby's slot is a slot created on the
standby at the first glance.

---
Do we have tests for invalidating slots on the standbys?

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message SATYANARAYANA NARLAPURAM 2026-03-31 00:14:39 Re: POC: Parallel processing of indexes in autovacuum
Previous Message Jacob Champion 2026-03-30 23:54:18 Re: Custom oauth validator options