Re: Introduce XID age based replication slot invalidation

From: Srinath Reddy Sadipiralla <srinath2133(at)gmail(dot)com>
To: Bharath Rupireddy <bharath(dot)rupireddyforpostgres(at)gmail(dot)com>
Cc: Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>, SATYANARAYANA NARLAPURAM <satyanarlapuram(at)gmail(dot)com>, "Hayato Kuroda (Fujitsu)" <kuroda(dot)hayato(at)fujitsu(dot)com>, John H <johnhyvr(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Introduce XID age based replication slot invalidation
Date: 2026-04-07 14:39:45
Message-ID: CAFC+b6qOmhz9tEpPNrD5U1XGBWmVSHu6OoKoA-ZuGE=UkOVTEQ@mail.gmail.com
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

On Mon, Apr 6, 2026 at 11:12 PM Bharath Rupireddy <
bharath(dot)rupireddyforpostgres(at)gmail(dot)com> wrote:

> Hi,
>
> On Mon, Apr 6, 2026 at 1:45 AM Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>
> wrote:
> >
> > > I took a look at the v10 patch and it LGTM. I tested it - make
> > > check-world passes, pgindent doesn't complain.
> >
> > While reviewing the patch, I found that with this patch, backend
> > processes and autovacuum workers can simultaneously attempt to
> > invalidate the same slot for the same reason. When invalidating a
> > slot, we send a signal to the process owning the slot and wait for it
> > to exit and release the slot. If the process takes a long time to exit
> > for some reason, subsequent autovacuum workers attempting to
> > invalidate the same slot will also send a SIGTERM and get stuck at
> > InvalidatePossiblyObsoleteSlot(). In the worst case, this could result
> > in all autovacuum activity being blocked. I think we need to address
> > this problem.
>
> Thank you!
>
> You're right that multiple autovacuum workers can wait on the same
> slot for SIGTERM to take effect on the process (mainly walsenders)
> holding the slot. Once the process holding the slot exits, one worker
> finishes the invalidation and the others see it's done and move on.
>
> However, IMHO, this is unlikely to be a problem in practice.
>

I was able to reproduce this using pg_recvlogical on a slot, by pausing the
walsender using debugger , then i did some hacky stuff around the GUCs
(just to test), but in production IIUC I think During decoding a large
transaction
or network delay , the walsender gets stuck for "some" time, so backend and
autovacuum workers get stuck until then, after that they resume their work,
Correct me if I am wrong :)

If needed, we could add a flag to skip extra invalidation attempts
> based on field experience.
>

+1, yeah this would help other backends or autovacuum workers not
to retry again the same invalidation and stuck , instead they can check
the flag and be assured that slot invalidation is being taken care of,
so others can move on.

--
Thanks,
Srinath Reddy Sadipiralla
EDB: https://www.enterprisedb.com/

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Ashutosh Bapat 2026-04-07 14:46:59 Re: Better shared data structure management and resizable shared data structures
Previous Message Heikki Linnakangas 2026-04-07 14:38:36 Re: Assertion failure in hash_kill_items()