| From: | Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com> |
|---|---|
| To: | Bharath Rupireddy <bharath(dot)rupireddyforpostgres(at)gmail(dot)com> |
| Cc: | Srinath Reddy Sadipiralla <srinath2133(at)gmail(dot)com>, SATYANARAYANA NARLAPURAM <satyanarlapuram(at)gmail(dot)com>, "Hayato Kuroda (Fujitsu)" <kuroda(dot)hayato(at)fujitsu(dot)com>, John H <johnhyvr(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org> |
| Subject: | Re: Introduce XID age based replication slot invalidation |
| Date: | 2026-04-01 19:38:23 |
| Message-ID: | CAD21AoB61CgA921J2FwXzxfxUsxQzi7GYAZaH05PHmb6HFe_hg@mail.gmail.com |
| Views: | Whole Thread | Raw Message | Download mbox | Resend email |
| Thread: | |
| Lists: | pgsql-hackers |
On Tue, Mar 31, 2026 at 10:21 AM Bharath Rupireddy
<bharath(dot)rupireddyforpostgres(at)gmail(dot)com> wrote:
>
> Hi,
>
> On Mon, Mar 30, 2026 at 5:13 PM Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com> wrote:
> >
> > I've reviewed the v6 patch. Here are some comments.
>
> Thank you for reviewing the patch.
>
> > bool
> > vacuum_get_cutoffs(Relation rel, const VacuumParams params,
> > - struct VacuumCutoffs *cutoffs)
> > + struct VacuumCutoffs *cutoffs,
> > + TransactionId *slot_xmin,
> > + TransactionId *slot_catalog_xmin)
> >
> > How about storing both slot_xmin and catalog_xmin into VacuumCutoffs?
>
> Done.
>
> > ---
> > - if (InvalidateObsoleteReplicationSlots(RS_INVAL_WAL_REMOVED |
> > RS_INVAL_IDLE_TIMEOUT,
> > + possibleInvalidationCauses = RS_INVAL_WAL_REMOVED | RS_INVAL_IDLE_TIMEOUT |
> > + RS_INVAL_XID_AGE;
> > +
> > + if (InvalidateObsoleteReplicationSlots(possibleInvalidationCauses,
> > _logSegNo, InvalidOid,
> > + InvalidTransactionId,
> > + max_slot_xid_age > 0 ?
> > + ReadNextTransactionId() :
> > InvalidTransactionId))
> >
> > It's odd to me that we specify RS_INVAL_XID_AGE while passing
> > InvalidTransactionId. I think we can specify RS_INVAL_XID_AGE along
> > with a valid recentXId only when we'd like to check the slots based on
> > their XIDs.
>
> Done.
>
> > ---
> > + /* Check if the slot needs to be invalidated due to max_slot_xid_age GUC */
> > + if ((possible_causes & RS_INVAL_XID_AGE) && CanInvalidateXidAgedSlot(s))
> > + {
> > + TransactionId xidLimit;
> > +
> > + Assert(TransactionIdIsValid(recentXid));
> > +
> > + xidLimit = TransactionIdRetreatedBy(recentXid, max_slot_xid_age);
> > +
> >
> > I think we can avoid calculating xidLimit for every slot by
> > calculating it in InvalidatePossiblyObsoleteSlot() and passing it to
> > DetermineSlotInvalidationCause().
>
> Done.
>
> > ---
> > */
> > TransactionId
> > GetOldestNonRemovableTransactionId(Relation rel)
> > +{
> > + return GetOldestNonRemovableTransactionIdExt(rel, NULL, NULL);
> > +}
> > +
> > +/*
> > + * Same as GetOldestNonRemovableTransactionId(), but also returns the
> > + * replication slot xmin and catalog_xmin from the same ComputeXidHorizons()
> > + * call. This avoids a separate ProcArrayLock acquisition when the caller
> > + * needs both values.
> > + */
> > +TransactionId
> > +GetOldestNonRemovableTransactionIdExt(Relation rel,
> > + TransactionId *slot_xmin,
> > + TransactionId *slot_catalog_xmin)
> > {
> >
> > I understand that the primary reason why the patch introduces another
> > variant of GetOldestNonRemovableTransactionId() is to avoid extra
> > ProcArrayLock acquision to get replication slot xmin and catalog_xmin.
> > While it's not very elegant, I find that it would not be bad because
> > otherwise autovacuum takes extra ProcArrayLock (in shared mode) for
> > every table to vacuum. The ProcArrayLock is already known
> > high-contented lock it would be better to avoid taking it once more.
> > If others think differently, we can just call
> > ProcArrayGetReplicationSlotXmin() separately and compare them to the
> > limit of XID-age based slot invalidation.
>
> I understand the concerns around the ProcArrayLock and I think a new
> function to return the computed slot's xmin and catalog_xmin is good.
>
> > Having said that, I personally don't want to add new instructions to
> > the existing GetOldestNonRemovableTransactionId(). I guess we might
> > want to make both the existing function and new function call a common
> > (inline) function that takes ComputeXidHorizonsResult and returns
> > appropriate transaction id based on the given relation .
>
> Done.
>
> > ---
> > + # Do some work to advance xids
> > + $node->safe_psql(
> > + 'postgres', qq[
> > + do \$\$
> > + begin
> > + for i in 1..$nxids loop
> > + -- use an exception block so that each iteration eats an XID
> > + begin
> > + insert into $table_name values (i);
> > + exception
> > + when division_by_zero then null;
> > + end;
> > + end loop;
> > + end\$\$;
> > + ]);
> >
> > I think it's fater to use pg_current_xact_id() instead.
>
> Done. I pulled this from an existing test case in 001_stream_rep.pl.
> Used the pg_current_xact_id approach. Testing times stay the same i.e.
> 9 wallclock secs.
>
> > ---
> > + else
> > + {
> > + $node->safe_psql('postgres', "VACUUM");
> > + }
> >
> > We don't need to vacuum all tables here.
>
> Fixed.
>
> > ---
> > +# Configure primary with XID age settings. Set autovacuum_naptime high so
> > +# that the checkpointer (not vacuum) triggers the invalidation.
> > +my $max_slot_xid_age = 500;
> > +$primary5->append_conf(
> > + 'postgresql.conf', qq{
> > +max_slot_xid_age = $max_slot_xid_age
> > +autovacuum_naptime = '1h'
> > +});
> >
> > I think that it's better to disable autovacuum than setting a large number.
>
> Done.
>
> > ---
> > +# Testcase end: Invalidate streaming standby's slot due to max_slot_xid_age
> > +# GUC (via checkpoint).
> >
> > I think that we can say "physical slot" instead of standby's slot to
> > avoid confusion as I thought standby's slot is a slot created on the
> > standby at the first glance.
>
> Fixed.
>
> > ---
> > Do we have tests for invalidating slots on the standbys?
>
> Added a test case for this.
>
> Please find the attached v7 patches for further review. Thank you!
I've reviewed the v7 patch and have some review comments:
+# Advance the given number of XIDs
+sub advance_xids
+{
+ my ($node, $nxids) = @_;
+ my $sql = join(";\n", ("SELECT pg_current_xact_id()") x $nxids);
+ $node->safe_psql('postgres', $sql);
+}
I think we can create a procedure on primary5 instance to consume XIDs
as follow:
$standby5->safe_psql(
'postgres',
qq{CREATE PROCEDURE consume_xid(cnt int)
AS \$\$
DECLARE
i int;
BEGIN
FOR i in 1..cnt LOOP
EXECUTE 'SELECT pg_current_xact_id()';
COMMIT;
END LOOP;
END;
+\$\$
LANGUAGE plpgsql;
});
---
+# Create a subscriber node
+my $subscriber5 = PostgreSQL::Test::Cluster->new('subscriber5');
+$subscriber5->init(allows_streaming => 'logical');
+$subscriber5->start;
Do we really need to create a subscriber for this test? I think we can
simply create a logical slot on the primary5 and test the XID-age
based slot invalidation.
---
I've attached a fixup patch to propose some cleanup and refactoring, including:
- changes to invalidation errdetail message.
- passing xidLimit instead of recentXid to simplify the invalidation logic.
- documentation changes.
- comment changes.
Regards,
--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com
| Attachment | Content-Type | Size |
|---|---|---|
| fixup_for_v7_masahiko.patch | text/x-patch | 12.7 KB |
| From | Date | Subject | |
|---|---|---|---|
| Next Message | Corey Huinker | 2026-04-01 20:07:49 | Re: CAST(... ON DEFAULT) - WIP build on top of Error-Safe User Functions |
| Previous Message | Tom Lane | 2026-04-01 19:27:54 | Re: scale parallel_tuple_cost by tuple width |