Re: windows CI failing PMSignalState->PMChildFlags[slot] == PM_CHILD_ASSIGNED

From: Andres Freund <andres(at)anarazel(dot)de>
To: Alexander Lakhin <exclusion(at)gmail(dot)com>
Cc: Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: windows CI failing PMSignalState->PMChildFlags[slot] == PM_CHILD_ASSIGNED
Date: 2023-02-18 20:09:00
Message-ID: 20230218200900.g7ejgscb4zlih5o3@awork3.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

On 2023-02-18 18:00:00 +0300, Alexander Lakhin wrote:
> 18.02.2023 04:06, Andres Freund wrote:
> > On 2023-02-18 13:27:04 +1300, Thomas Munro wrote:
> > How can a process that we did notify crashing, that has already executed
> > SQL statements, end up in MarkPostmasterChildActive()?
>
> Maybe it's just the backend started for the money test has got
> the same PID (5948) that the backend for the name test had?

I somehow mashed name and money into one test in my head... So forget what I
wrote.

That doesn't really explain the assertion though.

It's too bad that we didn't use doesn't include
log_connections/log_disconnections. If nothing else, it makes it a lot easier
to identify problems like that. We actually do try to configure it for CI, but
it currently doesn't work for pg_regress style tests with meson. Need to fix
that. Starting a thread.

One thing that made me very suspicious when reading related code is this
remark:

bool
ReleasePostmasterChildSlot(int slot)
...
/*
* Note: the slot state might already be unused, because the logic in
* postmaster.c is such that this might get called twice when a child
* crashes. So we don't try to Assert anything about the state.
*/

That seems fragile, and potentially racy. What if we somehow can end up
starting another backend inbetween the two ReleasePostmasterChildSlot() calls,
we can end up marking a slot that, newly, has a process associated with it, as
inactive? Once the slot has been released the first time, it can be assigned
again.

ISTM that it's not a good idea that we use PM_CHILD_ASSIGNED to signal both,
that a slot has not been used yet, and that it's not in use anymore. I think
that makes it quite a bit harder to find state management issues.

> A simple script that I've found [1] shows that the pids reused rather often
> (for me, approximately each 300 process starts in Windows 10 H2), buy maybe
> under some circumstances (many concurrent processes?) PIDs can coincide even
> so often to trigger that behavior.

It's definitely very aggressive in reusing pids - and it seems to
intentionally do work to keep pids small. I wonder if it'd be worth trying to
exercise this path aggressively by configuring a very low max pid on linux, in
an EXEC_BACKEND build.

Greetings,

Andres Freund

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Andres Freund 2023-02-18 20:26:11 Handle TEMP_CONFIG for pg_regress style tests in pg_regress.c
Previous Message Tomas Vondra 2023-02-18 20:05:10 Re: PATCH: Using BRIN indexes for sorted output