| From: | Srinath Reddy Sadipiralla <srinath2133(at)gmail(dot)com> |
|---|---|
| To: | Bharath Rupireddy <bharath(dot)rupireddyforpostgres(at)gmail(dot)com> |
| Cc: | Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>, SATYANARAYANA NARLAPURAM <satyanarlapuram(at)gmail(dot)com>, "Hayato Kuroda (Fujitsu)" <kuroda(dot)hayato(at)fujitsu(dot)com>, John H <johnhyvr(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org> |
| Subject: | Re: Introduce XID age based replication slot invalidation |
| Date: | 2026-04-07 14:39:45 |
| Message-ID: | CAFC+b6qOmhz9tEpPNrD5U1XGBWmVSHu6OoKoA-ZuGE=UkOVTEQ@mail.gmail.com |
| Views: | Whole Thread | Raw Message | Download mbox | Resend email |
| Thread: | |
| Lists: | pgsql-hackers |
Hi,
On Mon, Apr 6, 2026 at 11:12 PM Bharath Rupireddy <
bharath(dot)rupireddyforpostgres(at)gmail(dot)com> wrote:
> Hi,
>
> On Mon, Apr 6, 2026 at 1:45 AM Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>
> wrote:
> >
> > > I took a look at the v10 patch and it LGTM. I tested it - make
> > > check-world passes, pgindent doesn't complain.
> >
> > While reviewing the patch, I found that with this patch, backend
> > processes and autovacuum workers can simultaneously attempt to
> > invalidate the same slot for the same reason. When invalidating a
> > slot, we send a signal to the process owning the slot and wait for it
> > to exit and release the slot. If the process takes a long time to exit
> > for some reason, subsequent autovacuum workers attempting to
> > invalidate the same slot will also send a SIGTERM and get stuck at
> > InvalidatePossiblyObsoleteSlot(). In the worst case, this could result
> > in all autovacuum activity being blocked. I think we need to address
> > this problem.
>
> Thank you!
>
> You're right that multiple autovacuum workers can wait on the same
> slot for SIGTERM to take effect on the process (mainly walsenders)
> holding the slot. Once the process holding the slot exits, one worker
> finishes the invalidation and the others see it's done and move on.
>
> However, IMHO, this is unlikely to be a problem in practice.
>
I was able to reproduce this using pg_recvlogical on a slot, by pausing the
walsender using debugger , then i did some hacky stuff around the GUCs
(just to test), but in production IIUC I think During decoding a large
transaction
or network delay , the walsender gets stuck for "some" time, so backend and
autovacuum workers get stuck until then, after that they resume their work,
Correct me if I am wrong :)
If needed, we could add a flag to skip extra invalidation attempts
> based on field experience.
>
+1, yeah this would help other backends or autovacuum workers not
to retry again the same invalidation and stuck , instead they can check
the flag and be assured that slot invalidation is being taken care of,
so others can move on.
--
Thanks,
Srinath Reddy Sadipiralla
EDB: https://www.enterprisedb.com/
| From | Date | Subject | |
|---|---|---|---|
| Next Message | Ashutosh Bapat | 2026-04-07 14:46:59 | Re: Better shared data structure management and resizable shared data structures |
| Previous Message | Heikki Linnakangas | 2026-04-07 14:38:36 | Re: Assertion failure in hash_kill_items() |