Re: windows CI failing PMSignalState->PMChildFlags[slot] == PM_CHILD_ASSIGNED

From: Andres Freund <andres(at)anarazel(dot)de>
To: Alexander Lakhin <exclusion(at)gmail(dot)com>
Cc: Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: windows CI failing PMSignalState->PMChildFlags[slot] == PM_CHILD_ASSIGNED
Date: 2023-02-18 20:09:00
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers


On 2023-02-18 18:00:00 +0300, Alexander Lakhin wrote:
> 18.02.2023 04:06, Andres Freund wrote:
> > On 2023-02-18 13:27:04 +1300, Thomas Munro wrote:
> > How can a process that we did notify crashing, that has already executed
> > SQL statements, end up in MarkPostmasterChildActive()?
> Maybe it's just the backend started for the money test has got
> the same PID (5948) that the backend for the name test had?

I somehow mashed name and money into one test in my head... So forget what I

That doesn't really explain the assertion though.

It's too bad that we didn't use doesn't include
log_connections/log_disconnections. If nothing else, it makes it a lot easier
to identify problems like that. We actually do try to configure it for CI, but
it currently doesn't work for pg_regress style tests with meson. Need to fix
that. Starting a thread.

One thing that made me very suspicious when reading related code is this

ReleasePostmasterChildSlot(int slot)
* Note: the slot state might already be unused, because the logic in
* postmaster.c is such that this might get called twice when a child
* crashes. So we don't try to Assert anything about the state.

That seems fragile, and potentially racy. What if we somehow can end up
starting another backend inbetween the two ReleasePostmasterChildSlot() calls,
we can end up marking a slot that, newly, has a process associated with it, as
inactive? Once the slot has been released the first time, it can be assigned

ISTM that it's not a good idea that we use PM_CHILD_ASSIGNED to signal both,
that a slot has not been used yet, and that it's not in use anymore. I think
that makes it quite a bit harder to find state management issues.

> A simple script that I've found [1] shows that the pids reused rather often
> (for me, approximately each 300 process starts in Windows 10 H2), buy maybe
> under some circumstances (many concurrent processes?) PIDs can coincide even
> so often to trigger that behavior.

It's definitely very aggressive in reusing pids - and it seems to
intentionally do work to keep pids small. I wonder if it'd be worth trying to
exercise this path aggressively by configuring a very low max pid on linux, in
an EXEC_BACKEND build.


Andres Freund

In response to


Browse pgsql-hackers by date

  From Date Subject
Next Message Andres Freund 2023-02-18 20:26:11 Handle TEMP_CONFIG for pg_regress style tests in pg_regress.c
Previous Message Tomas Vondra 2023-02-18 20:05:10 Re: PATCH: Using BRIN indexes for sorted output