Quick Links

Re: VM corruption on standby

From:	Kirill Reshke <reshkekirill(at)gmail(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Yura Sokolov <y(dot)sokolov(at)postgrespro(dot)ru>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, Andrey Borodin <x4mmm(at)yandex-team(dot)ru>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Melanie Plageman <melanieplageman(at)gmail(dot)com>
Subject:	Re: VM corruption on standby
Date:	2025-08-19 18:23:01
Message-ID:	CALdSSPg=dMLayXHMcW8tuLG0aBJSezCh=P6Yo5FJ4rVRR5TzKA@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

On Tue, 19 Aug 2025 at 23:08, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>
> Kirill Reshke <reshkekirill(at)gmail(dot)com> writes:
> > On Tue, 19 Aug 2025 at 21:16, Yura Sokolov <y(dot)sokolov(at)postgrespro(dot)ru> wrote:
> >> `if (CritSectionCount != 0) _exit(2) else proc_exit(1)` in
> >> WaitEventSetWaitBlock () solves the issue of inconsistency IF POSTMASTER IS
> >> SIGKILLED, and doesn't lead to any problem, if postmaster is not SIGKILL-ed
> >> (since postmaster will SIGKILL its children).
>
> > This fix was proposed in this thread. It fixes inconsistency but it
> > replaces one set of problems with another set, namely systems that
> > fail to shut down.
>
> I think a bigger objection is that it'd result in two separate
> shutdown behaviors in what's already an extremely under-tested
> (and hard to test) scenario. I don't want to have to deal with
> the ensuing state-space explosion.

Agreed.

> I still think that proc_exit(1) is fundamentally the wrong thing
> to do if the postmaster is gone: that code path assumes that
> the cluster is still functional, which is at best shaky.
> I concur though that we'd have to do some more engineering work
> before _exit(2) would be a practical solution.

Agreed.

> In the meantime, it seems like this discussion point arises
> only because the presented test case is doing something that
> seems pretty unsafe, namely invoking WaitEventSet inside a
> critical section.

Agreed.

> We'd probably be best off to get back to the actual bug the
> thread started with, namely whether we aren't doing the wrong
> thing with VM-update order of operations.
>
> regards, tom lane

My understanding is that there is no bug in the VM. At least not in
[0] test, because it uses an injection point in the CRIT section,
making the server exit too early.
So, behaviour with inj point and without are very different.
The corruption we are looking for has to reproducer (see [1]).

[0] https://www.postgresql.org/message-id/B3C69B86-7F82-4111-B97F-0005497BB745%40yandex-team.ru
[1] https://www.postgresql.org/message-id/CALdSSPhGQ1xx10c2NaZgce8qmi%2BSuKFp6T1uWG_aZvPpvoJRkQ%40mail.gmail.com

--
Best regards,
Kirill Reshke

In response to

Re: VM corruption on standby at 2025-08-19 18:08:19 from Tom Lane

Responses

Re: VM corruption on standby at 2025-08-19 18:34:21 from Andrey Borodin

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Tom Lane	2025-08-19 18:28:03	Re: RFC: extensible planner state
Previous Message	Peter Geoghegan	2025-08-19 18:22:00	Re: index prefetching