| From: | Joao Foltran <joao(at)foltrandba(dot)com> |
|---|---|
| To: | Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> |
| Cc: | "Zhijie Hou (Fujitsu)" <houzj(dot)fnst(at)fujitsu(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org> |
| Subject: | Re: [BUG] [PATCH] Allow physical replication slots to recover from archive after invalidation |
| Date: | 2026-03-31 17:25:51 |
| Message-ID: | CAF8B20Cvh-pdr37DpN_-n1tjpS8zLQB5JTbPbZzewvww0VOyBA@mail.gmail.com |
| Views: | Whole Thread | Raw Message | Download mbox | Resend email |
| Thread: | |
| Lists: | pgsql-hackers |
Hello all,
I've made a v2 of this patch, turning it into a patchset with guidance
from Fabrizio Mello.
This patchset includes a new feature that self-heals (auto
revalidates) physical replication slots after they have been
invalidated for two reasons: RS_INVAL_WAL_REMOVED or
RS_INVAL_IDLE_TIMEOUT.
Requiring an user to manually recreate slot isn't necessary in cases
where the standby server connected to these slots recovers itself
using restore_command and can become burdensome when managing a fleet
of clusters, creating a need to handle this kind of problem
automatically due to the scale of your operation.
The patch adds a opt-in mechanism that allows the physical slots to be
reinvalidated in those cases, a new persistent field called
`auto_revalidate` (default false) controls which physical slots are
eligible. When enabled, StartReplication issues a WARNING instead of
an ERROR when acquiring physical invalidated slots and
PhysicalConfirmReceivedLocation clears the invalidation atomically
with the restart_lsn update upon the first flush ACK. The revalidation
is persisted to disk immediately so it survives a crash.
Only RS_INVAL_WAL_REMOVED and RS_INVAL_IDLE_TIMEOUT revalidatable, via
an explicit allowlist in SlotCanBeRevalidated(). Future invalidation
reasons must be added there to become eligible.
I appreciate Fabrizio's help reviewing everything and walking me
through my questions.
The series is split into five patches:
0001 - Core infrastructure: SlotCanBeRevalidated helper, SlotIsValid
macro, revalidation logic in walsender.c, SLOT_VERSION bump.
0002 - SQL function: new auto_revalidate parameter on
pg_create_physical_replication_slot(), copy-path propagation via
pg_copy_physical_replication_slot(), regression test.
0003 - View exposure: auto_revalidate column in pg_replication_slots.
0004 - TAP recovery test: six scenarios covering revalidation, WAL
retention, xmin recovery, error preservation for
auto_revalidate=false, slot copy revalidation, and idle_timeout
revalidation (some of these require injection_points).
0005 - Documentation: system-views.sgml and func-admin.sgml.
João Foltran
Linkedin: https://www.linkedin.com/in/joao-foltran-031b9312b
On Thu, Jan 22, 2026 at 4:41 PM Joao Foltran <joao(at)foltrandba(dot)com> wrote:
>
> Hi Amit!
>
> Unless we have hot_standby_feedback = on, xmin would be null on the
> physical replication slot.
>
> But, even if using that parameter, as long as we know that the standby
> already has caught up by using the archived wals then the xmin
> wouldn't matter, since we don't need those rows to be visible anymore.
>
> I've attached a simple patch and test here that revalidates the slot
> after it is lost. It is still missing any filtering besides checking
> if the slot is physical or logical, but we can add filters for
> specific invalidations.
>
> Let me know what you think.
>
> Regards,
> João Foltran
>
> On Wed, Jan 14, 2026 at 8:21 AM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
> >
> > On Tue, Jan 6, 2026 at 3:26 AM Joao Foltran <joao(at)foltrandba(dot)com> wrote:
> > >
> > > > The slots could be invalidated due to other reasons like
> > > > RS_INVAL_IDLE_TIMEOUT as well.
> > >
> > > We could just filter which invalidation reasons could be "revalidated"
> > > for only reasons that can be resolved this way.
> > >
> >
> > Can we make the slot valid even the required WAL is made available
> > afterwards? What about the removed rows due to the slot's xmin?
> >
> > --
> > With Regards,
> > Amit Kapila.
| Attachment | Content-Type | Size |
|---|---|---|
| v2-0005-Add-documentation-for-auto_revalidate.patch | application/x-patch | 3.4 KB |
| v2-0001-Add-auto-revalidation-infrastructure-for-physical.patch | application/x-patch | 6.7 KB |
| v2-0003-Expose-auto_revalidate-in-pg_replication_slots-vi.patch | application/x-patch | 4.5 KB |
| v2-0002-Add-auto_revalidate-parameter-to-pg_create_physic.patch | application/x-patch | 7.2 KB |
| v2-0004-Add-TAP-test-for-physical-replication-slot-auto-r.patch | application/x-patch | 16.5 KB |
| From | Date | Subject | |
|---|---|---|---|
| Next Message | Masahiko Sawada | 2026-03-31 17:28:11 | Re: Initial COPY of Logical Replication is too slow |
| Previous Message | Álvaro Herrera | 2026-03-31 17:22:54 | Re: Improve pgindent's formatting named fields in struct literals and varidic functions |