Re: windows CI failing PMSignalState->PMChildFlags[slot] == PM_CHILD_ASSIGNED

From: Alexander Lakhin <exclusion(at)gmail(dot)com>
To: Andres Freund <andres(at)anarazel(dot)de>
Cc: Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: windows CI failing PMSignalState->PMChildFlags[slot] == PM_CHILD_ASSIGNED
Date: 2023-02-20 15:00:00
Message-ID: 4245b0de-9367-c443-8145-7b62b0ab3c41@gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

18.02.2023 23:09, Andres Freund wrote:
> Hi,
>
> On 2023-02-18 18:00:00 +0300, Alexander Lakhin wrote:
>> 18.02.2023 04:06, Andres Freund wrote:
>> Maybe it's just the backend started for the money test has got
>> the same PID (5948) that the backend for the name test had?
> I somehow mashed name and money into one test in my head... So forget what I
> wrote.
>
> That doesn't really explain the assertion though.
> ,,,
> It's definitely very aggressive in reusing pids - and it seems to
> intentionally do work to keep pids small. I wonder if it'd be worth trying to
> exercise this path aggressively by configuring a very low max pid on linux, in
> an EXEC_BACKEND build.
For information, I performed the simple analysis of
the postmaster.log after ordinary `vcregress check`:
grep -o -E 'client backend\[[0-9]+\] pg_regress/\w+'
..\..\..\src\test\regress\log\postmaster.log | sort | uniq | wc -l
grep -o -E 'client backend\[[0-9]+\] pg_regress'
..\..\..\src\test\regress\log\postmaster.log | sort | uniq | wc -l
and got the following numbers:
...
iteration 88
196
174
iteration 89
196
170
iteration 90
196
176
iteration 91
196
175

Then I performed some more experiments and came to the conclusion
that `vcregress check` is not very suitable for catching duplicate pids.
So I've wrote a simple TAP test and caught the condition sought:
t/099_check_pids.pl .. # 0
# 1
# 2
# 3
# 4
# 5
# 6
# Thread 15 failed:
# Got two equal pids in a row: 3552 on iteration 30
# 1

t/099_check_pids.pl .. 1/1 #   Failed test at t/099_check_pids.pl line 107.
...

server log contains:
...
2023-02-20 06:33:16.142 PST|[unknown]|[unknown]|5704|63f384ac.1648|LOG: 
connection received: host=127.0.0.1 port=55134
2023-02-20 06:33:16.142 PST|[unknown]|[unknown]|3552|63f384ac.de0|LOG: 
connection received: host=127.0.0.1 port=55138
2023-02-20 06:33:16.144 PST|postgres|postgres|5704|63f384ac.1648|LOG: 
connection authorized: user=postgres database=postgres
application_name=099_check_pids.pl
2023-02-20 06:33:16.144 PST|postgres|postgres|3552|63f384ac.de0|LOG: 
connection authorized: user=postgres database=postgres
application_name=099_check_pids.pl
2023-02-20 06:33:16.147 PST|postgres|postgres|5704|63f384ac.1648|LOG: 
statement: SELECT pg_backend_pid()
2023-02-20 06:33:16.147 PST|postgres|postgres|3552|63f384ac.de0|LOG: 
statement: SELECT pg_backend_pid()
2023-02-20 06:33:16.147 PST|postgres|postgres|5704|63f384ac.1648|LOG: 
disconnection: session time: 0:00:00.009 user=postgres database=postgres
host=127.0.0.1 port=55134
2023-02-20 06:33:16.147 PST|postgres|postgres|3552|63f384ac.de0|LOG: 
disconnection: session time: 0:00:00.008 user=postgres database=postgres
host=127.0.0.1 port=55138
2023-02-20 06:33:16.158 PST|[unknown]|[unknown]|1672|63f384ac.688|LOG: 
connection received: host=127.0.0.1 port=55139

...
2023-02-20 06:33:16.485 PST|postgres|postgres|2748|63f384ac.abc|LOG: 
connection authorized: user=postgres database=postgres
application_name=099_check_pids.pl
2023-02-20 06:33:16.486 PST|[unknown]|[unknown]|3552|63f384ac.de0|LOG: 
connection received: host=127.0.0.1 port=55164
2023-02-20 06:33:16.487 PST|postgres|postgres|3552|63f384ac.de0|LOG: 
connection authorized: user=postgres database=postgres
application_name=099_check_pids.pl
2023-02-20 06:33:16.488 PST|postgres|postgres|2748|63f384ac.abc|LOG: 
statement: SELECT pg_backend_pid()
2023-02-20 06:33:16.489 PST|postgres|postgres|2748|63f384ac.abc|LOG: 
disconnection: session time: 0:00:00.007 user=postgres database=postgres
host=127.0.0.1 port=55163
2023-02-20 06:33:16.490 PST|postgres|postgres|3552|63f384ac.de0|LOG: 
statement: SELECT pg_backend_pid()
2023-02-20 06:33:16.491 PST|postgres|postgres|3552|63f384ac.de0|LOG: 
disconnection: session time: 0:00:00.008 user=postgres database=postgres
host=127.0.0.1 port=55164
2023-02-20 06:33:16.503 PST|[unknown]|[unknown]|244|63f384ac.f4|LOG: 
connection received: host=127.0.0.1 port=55162
...
(note that even session IDs are the same)

Though I've got no that assert yet, thus maybe as you said, the pid
duplication
is not the (only) condition that triggered it (or maybe something is
wrong on my side).

Interestingly, but I can get the expected duplicates only if I start the
VS prompt
and run the test immediately after a logon (or VM reboot).
After some activity (or may be some time), the test can run for 30 minutes
without success (maybe it depends on OS cache...).

Also with a similar TAP test I discovered that on Linux (Debian 11
32-bit) pids
are generated sequentially:
# pid: 329
# pid: 331
# pid: 333
# pid: 336
# pid: 338
# pid: 340
# pid: 343
# pid: 345
# pid: 349
# pid: 353
So with an extra small max_pid (<360) I just get
"could not fork new process for connection: Resource temporarily
unavailable'"
before two processes, that started one after another, get the same pid.

But on Windows the sequence looks random:
# pid: 1736
# pid: 8168
# pid: 3764
# pid: 7180
# pid: 3724
# pid: 5372
# pid: 1588
# pid: 4188
# pid: 1404
# pid: 5280

Best regards,
Alexander

Attachment Content-Type Size
099_check_pids.pl application/x-perl 2.7 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Peter Eisentraut 2023-02-20 15:08:00 Re: Improving inferred query column names
Previous Message Peter Eisentraut 2023-02-20 14:55:45 Re: Move defaults toward ICU in 16?