Introduce XID age based replication slot invalidation

From: John H <johnhyvr(at)gmail(dot)com>
To: Bharath Rupireddy <bharath(dot)rupireddyforpostgres(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Introduce XID age based replication slot invalidation
Date: 2025-09-18 17:20:22
Message-ID: CA+-JvFsMHckBMzsu5Ov9HCG3AFbMh056hHy1FiXazBRtZ9pFBg@mail.gmail.com
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi folks,

I'd like to restart the discussion about providing an xid-based slot
invalidation mechanism. The previous effort [1] presented an XID and
time-based invalidation and the inactive time-based approach was
implemented first. The latest XID based patch from Bharath Rupireddy
can be found here [2].

When thinking about availability of the database, inactive replication
slots cause two main pain points:
1) WAL accumulation
2) Replication slots with xmin/catalog_xmin can hold back vacuuming
leading to wrap-around

The first issue can be mitigated by 'max_slot_wal_keep_size'. However
in the second case there are no good mechanisms to prioritize write
availability of the database and avoid wraparound. The new GUC
'idle_replication_slot_timeout' partially addresses the concern if you
have similar workloads. However it's hard to set the same setting
across a fleet of different applications.

It's easy to imagine a high-XID churning workload in one cluster while
another has large batch jobs where changes get synced out
periodically. There isn't a "one-size" fits all setting for
'idle_replication_slot_timeout' in these two cases.

The attached patch addresses this by introducing 'max_slot_xid_age' in
a similar fashion. Replication slots with transaction ID greater than
the set age will get invalidated allowing vacuum to proceed, biasing
towards database availability.

Invalidation happens in CHECKPOINT, similar to
'idle_replication_slot_timeout', and when VACUUM occurs.

The patch currently attempts to invalidate once-per-autovacuum worker.
We're wondering if it should attempt invalidation on a per-relation
basis within the vacuum call itself. That would account for scenarios
where the cost_delay or naptime is high between autovac executions.

Thanks,

John H

[1] https://www.postgresql.org/message-id/flat/CALj2ACW4aUe-_uFQOjdWCEN-xXoLGhmvRFnL8SNw_TZ5nJe%2Baw%40mail.gmail.com
[2] https://www.postgresql.org/message-id/flat/CALj2ACXe8%2BxSNdMXTMaSRWUwX7v61Ad4iddUwnn%3DdjSwx3GLLg%40mail.gmail.com

--
John Hsu - Amazon Web Services

Attachment Content-Type Size
0044-Add-XID-age-based-replication-slot-invalidation.patch application/octet-stream 23.2 KB

Browse pgsql-hackers by date

  From Date Subject
Next Message Robert Haas 2025-09-18 17:23:11 Re: plan shape work
Previous Message Jacob Champion 2025-09-18 17:08:49 Updating IPC::Run in CI?