From: | Andres Freund <andres(at)anarazel(dot)de> |
---|---|
To: | Alexander Lakhin <exclusion(at)gmail(dot)com> |
Cc: | Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org |
Subject: | Re: windows CI failing PMSignalState->PMChildFlags[slot] == PM_CHILD_ASSIGNED |
Date: | 2023-02-18 20:09:00 |
Message-ID: | 20230218200900.g7ejgscb4zlih5o3@awork3.anarazel.de |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
Hi,
On 2023-02-18 18:00:00 +0300, Alexander Lakhin wrote:
> 18.02.2023 04:06, Andres Freund wrote:
> > On 2023-02-18 13:27:04 +1300, Thomas Munro wrote:
> > How can a process that we did notify crashing, that has already executed
> > SQL statements, end up in MarkPostmasterChildActive()?
>
> Maybe it's just the backend started for the money test has got
> the same PID (5948) that the backend for the name test had?
I somehow mashed name and money into one test in my head... So forget what I
wrote.
That doesn't really explain the assertion though.
It's too bad that we didn't use doesn't include
log_connections/log_disconnections. If nothing else, it makes it a lot easier
to identify problems like that. We actually do try to configure it for CI, but
it currently doesn't work for pg_regress style tests with meson. Need to fix
that. Starting a thread.
One thing that made me very suspicious when reading related code is this
remark:
bool
ReleasePostmasterChildSlot(int slot)
...
/*
* Note: the slot state might already be unused, because the logic in
* postmaster.c is such that this might get called twice when a child
* crashes. So we don't try to Assert anything about the state.
*/
That seems fragile, and potentially racy. What if we somehow can end up
starting another backend inbetween the two ReleasePostmasterChildSlot() calls,
we can end up marking a slot that, newly, has a process associated with it, as
inactive? Once the slot has been released the first time, it can be assigned
again.
ISTM that it's not a good idea that we use PM_CHILD_ASSIGNED to signal both,
that a slot has not been used yet, and that it's not in use anymore. I think
that makes it quite a bit harder to find state management issues.
> A simple script that I've found [1] shows that the pids reused rather often
> (for me, approximately each 300 process starts in Windows 10 H2), buy maybe
> under some circumstances (many concurrent processes?) PIDs can coincide even
> so often to trigger that behavior.
It's definitely very aggressive in reusing pids - and it seems to
intentionally do work to keep pids small. I wonder if it'd be worth trying to
exercise this path aggressively by configuring a very low max pid on linux, in
an EXEC_BACKEND build.
Greetings,
Andres Freund
From | Date | Subject | |
---|---|---|---|
Next Message | Andres Freund | 2023-02-18 20:26:11 | Handle TEMP_CONFIG for pg_regress style tests in pg_regress.c |
Previous Message | Tomas Vondra | 2023-02-18 20:05:10 | Re: PATCH: Using BRIN indexes for sorted output |