From: | Kirill Reshke <reshkekirill(at)gmail(dot)com> |
---|---|
To: | Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> |
Cc: | Yura Sokolov <y(dot)sokolov(at)postgrespro(dot)ru>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, Andrey Borodin <x4mmm(at)yandex-team(dot)ru>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Melanie Plageman <melanieplageman(at)gmail(dot)com> |
Subject: | Re: VM corruption on standby |
Date: | 2025-08-19 18:23:01 |
Message-ID: | CALdSSPg=dMLayXHMcW8tuLG0aBJSezCh=P6Yo5FJ4rVRR5TzKA@mail.gmail.com |
Views: | Whole Thread | Raw Message | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On Tue, 19 Aug 2025 at 23:08, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>
> Kirill Reshke <reshkekirill(at)gmail(dot)com> writes:
> > On Tue, 19 Aug 2025 at 21:16, Yura Sokolov <y(dot)sokolov(at)postgrespro(dot)ru> wrote:
> >> `if (CritSectionCount != 0) _exit(2) else proc_exit(1)` in
> >> WaitEventSetWaitBlock () solves the issue of inconsistency IF POSTMASTER IS
> >> SIGKILLED, and doesn't lead to any problem, if postmaster is not SIGKILL-ed
> >> (since postmaster will SIGKILL its children).
>
> > This fix was proposed in this thread. It fixes inconsistency but it
> > replaces one set of problems with another set, namely systems that
> > fail to shut down.
>
> I think a bigger objection is that it'd result in two separate
> shutdown behaviors in what's already an extremely under-tested
> (and hard to test) scenario. I don't want to have to deal with
> the ensuing state-space explosion.
Agreed.
> I still think that proc_exit(1) is fundamentally the wrong thing
> to do if the postmaster is gone: that code path assumes that
> the cluster is still functional, which is at best shaky.
> I concur though that we'd have to do some more engineering work
> before _exit(2) would be a practical solution.
Agreed.
> In the meantime, it seems like this discussion point arises
> only because the presented test case is doing something that
> seems pretty unsafe, namely invoking WaitEventSet inside a
> critical section.
Agreed.
> We'd probably be best off to get back to the actual bug the
> thread started with, namely whether we aren't doing the wrong
> thing with VM-update order of operations.
>
> regards, tom lane
My understanding is that there is no bug in the VM. At least not in
[0] test, because it uses an injection point in the CRIT section,
making the server exit too early.
So, behaviour with inj point and without are very different.
The corruption we are looking for has to reproducer (see [1]).
[0] https://www.postgresql.org/message-id/B3C69B86-7F82-4111-B97F-0005497BB745%40yandex-team.ru
[1] https://www.postgresql.org/message-id/CALdSSPhGQ1xx10c2NaZgce8qmi%2BSuKFp6T1uWG_aZvPpvoJRkQ%40mail.gmail.com
--
Best regards,
Kirill Reshke
From | Date | Subject | |
---|---|---|---|
Next Message | Tom Lane | 2025-08-19 18:28:03 | Re: RFC: extensible planner state |
Previous Message | Peter Geoghegan | 2025-08-19 18:22:00 | Re: index prefetching |