Re: [BUG] [PATCH] Allow physical replication slots to recover from archive after invalidation

From: Joao Foltran <joao(at)foltrandba(dot)com>
To: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
Cc: "Zhijie Hou (Fujitsu)" <houzj(dot)fnst(at)fujitsu(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [BUG] [PATCH] Allow physical replication slots to recover from archive after invalidation
Date: 2026-03-31 17:25:51
Message-ID: CAF8B20Cvh-pdr37DpN_-n1tjpS8zLQB5JTbPbZzewvww0VOyBA@mail.gmail.com
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hello all,

I've made a v2 of this patch, turning it into a patchset with guidance
from Fabrizio Mello.

This patchset includes a new feature that self-heals (auto
revalidates) physical replication slots after they have been
invalidated for two reasons: RS_INVAL_WAL_REMOVED or
RS_INVAL_IDLE_TIMEOUT.

Requiring an user to manually recreate slot isn't necessary in cases
where the standby server connected to these slots recovers itself
using restore_command and can become burdensome when managing a fleet
of clusters, creating a need to handle this kind of problem
automatically due to the scale of your operation.

The patch adds a opt-in mechanism that allows the physical slots to be
reinvalidated in those cases, a new persistent field called
`auto_revalidate` (default false) controls which physical slots are
eligible. When enabled, StartReplication issues a WARNING instead of
an ERROR when acquiring physical invalidated slots and
PhysicalConfirmReceivedLocation clears the invalidation atomically
with the restart_lsn update upon the first flush ACK. The revalidation
is persisted to disk immediately so it survives a crash.

Only RS_INVAL_WAL_REMOVED and RS_INVAL_IDLE_TIMEOUT revalidatable, via
an explicit allowlist in SlotCanBeRevalidated(). Future invalidation
reasons must be added there to become eligible.

I appreciate Fabrizio's help reviewing everything and walking me
through my questions.

The series is split into five patches:

0001 - Core infrastructure: SlotCanBeRevalidated helper, SlotIsValid
macro, revalidation logic in walsender.c, SLOT_VERSION bump.
0002 - SQL function: new auto_revalidate parameter on
pg_create_physical_replication_slot(), copy-path propagation via
pg_copy_physical_replication_slot(), regression test.
0003 - View exposure: auto_revalidate column in pg_replication_slots.
0004 - TAP recovery test: six scenarios covering revalidation, WAL
retention, xmin recovery, error preservation for
auto_revalidate=false, slot copy revalidation, and idle_timeout
revalidation (some of these require injection_points).
0005 - Documentation: system-views.sgml and func-admin.sgml.

João Foltran
Linkedin: https://www.linkedin.com/in/joao-foltran-031b9312b

On Thu, Jan 22, 2026 at 4:41 PM Joao Foltran <joao(at)foltrandba(dot)com> wrote:
>
> Hi Amit!
>
> Unless we have hot_standby_feedback = on, xmin would be null on the
> physical replication slot.
>
> But, even if using that parameter, as long as we know that the standby
> already has caught up by using the archived wals then the xmin
> wouldn't matter, since we don't need those rows to be visible anymore.
>
> I've attached a simple patch and test here that revalidates the slot
> after it is lost. It is still missing any filtering besides checking
> if the slot is physical or logical, but we can add filters for
> specific invalidations.
>
> Let me know what you think.
>
> Regards,
> João Foltran
>
> On Wed, Jan 14, 2026 at 8:21 AM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
> >
> > On Tue, Jan 6, 2026 at 3:26 AM Joao Foltran <joao(at)foltrandba(dot)com> wrote:
> > >
> > > > The slots could be invalidated due to other reasons like
> > > > RS_INVAL_IDLE_TIMEOUT as well.
> > >
> > > We could just filter which invalidation reasons could be "revalidated"
> > > for only reasons that can be resolved this way.
> > >
> >
> > Can we make the slot valid even the required WAL is made available
> > afterwards? What about the removed rows due to the slot's xmin?
> >
> > --
> > With Regards,
> > Amit Kapila.

Attachment Content-Type Size
v2-0005-Add-documentation-for-auto_revalidate.patch application/x-patch 3.4 KB
v2-0001-Add-auto-revalidation-infrastructure-for-physical.patch application/x-patch 6.7 KB
v2-0003-Expose-auto_revalidate-in-pg_replication_slots-vi.patch application/x-patch 4.5 KB
v2-0002-Add-auto_revalidate-parameter-to-pg_create_physic.patch application/x-patch 7.2 KB
v2-0004-Add-TAP-test-for-physical-replication-slot-auto-r.patch application/x-patch 16.5 KB

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Masahiko Sawada 2026-03-31 17:28:11 Re: Initial COPY of Logical Replication is too slow
Previous Message Álvaro Herrera 2026-03-31 17:22:54 Re: Improve pgindent's formatting named fields in struct literals and varidic functions