Re: windows CI failing PMSignalState->PMChildFlags[slot] == PM_CHILD_ASSIGNED

From: Alexander Lakhin <exclusion(at)gmail(dot)com>
To: Andres Freund <andres(at)anarazel(dot)de>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: windows CI failing PMSignalState->PMChildFlags[slot] == PM_CHILD_ASSIGNED
Date: 2023-02-18 15:00:00
Message-ID: bdb91b16-b6da-32a8-15fd-c024e11b453a@gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hello,
18.02.2023 04:06, Andres Freund wrote:
> Hi,
>
> On 2023-02-18 13:27:04 +1300, Thomas Munro wrote:
>> I still have no theory for how this condition was reached despite a
>> lot of time thinking about it and searching for more clues. As far as
>> I can tell, the recent improvements to postmaster's signal and event
>> handling shouldn't be related: the state management and logic was
>> unchanged.
> Yea, it's all very odd.
>
> If you look at the log:
>
> 2023-02-08 00:53:20.175 GMT client backend[5948] pg_regress/name DETAIL: No valid identifier after ".".
> 2023-02-08 00:53:20.175 GMT client backend[5948] pg_regress/name STATEMENT: SELECT parse_ident('xxx.1020');
> ...
> TRAP: failed Assert("PMSignalState->PMChildFlags[slot] == PM_CHILD_ASSIGNED"), File: "../src/backend/storage/ipc/pmsignal.c", Line: 329, PID: 5948
> abort() has been called
> ...
> 2023-02-08 00:53:27.420 GMT postmaster[872] LOG: server process (PID 5948) was terminated by exception 0xC0000354
> 2023-02-08 00:53:27.420 GMT postmaster[872] HINT: See C include file "ntstatus.h" for a description of the hexadecimal value.
> 2023-02-08 00:53:27.420 GMT postmaster[872] LOG: terminating any other active server processes
> 2023-02-08 00:53:27.434 GMT postmaster[872] LOG: all server processes terminated; reinitializing
>
>
> and that it's indeed the money test that failed:
> money ... FAILED (test process exited with exit code 2) 7337 ms
>
> it's very hard to understand how this stack can come to be:
>
> 00000085`f03ffa40 00007ff6`fd89faa8 ucrtbased!abort(void)+0x5a [minkernel\crts\ucrt\src\appcrt\startup\abort.cpp @ 77]
> 00000085`f03ffa80 00007ff6`fd6474dc postgres!ExceptionalCondition(
> char * conditionName = 0x00007ff6`fdd03ca8 "PMSignalState->PMChildFlags[slot] == PM_CHILD_ASSIGNED",
> char * fileName = 0x00007ff6`fdd03c80 "../src/backend/storage/ipc/pmsignal.c",
> int lineNumber = 0n329)+0x78 [c:\cirrus\src\backend\utils\error\assert.c @ 67]
> 00000085`f03ffac0 00007ff6`fd676eff postgres!MarkPostmasterChildActive(void)+0x7c [c:\cirrus\src\backend\storage\ipc\pmsignal.c @ 329]
> 00000085`f03ffb00 00007ff6`fd59aa3a postgres!InitProcess(void)+0x2ef [c:\cirrus\src\backend\storage\lmgr\proc.c @ 375]
> 00000085`f03ffb60 00007ff6`fd467689 postgres!SubPostmasterMain(
> int argc = 0n3,
> char ** argv = 0x000001c6`f3814e80)+0x33a [c:\cirrus\src\backend\postmaster\postmaster.c @ 4962]
> 00000085`f03ffd90 00007ff6`fda0e1c9 postgres!main(
> int argc = 0n3,
> char ** argv = 0x000001c6`f3814e80)+0x2f9 [c:\cirrus\src\backend\main\main.c @ 192]
>
> How can a process that we did notify crashing, that has already executed SQL
> statements, end up in MarkPostmasterChildActive()?
Maybe it's just the backend started for the money test has got
the same PID (5948) that the backend for the name test had?
A simple script that I've found [1] shows that the pids reused rather often
(for me, approximately each 300 process starts in Windows 10 H2), buy maybe
under some circumstances (many concurrent processes?) PIDs can coincide even
so often to trigger that behavior.

[1] https://superuser.com/questions/636497/does-windows-7-reuse-process-ids

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tomas Vondra 2023-02-18 15:54:57 Re: PATCH: Using BRIN indexes for sorted output
Previous Message Justin Pryzby 2023-02-18 14:09:31 Re: windows CI failing PMSignalState->PMChildFlags[slot] == PM_CHILD_ASSIGNED