From: | Thomas Munro <thomas(dot)munro(at)gmail(dot)com> |
---|---|
To: | PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org> |
Subject: | Core dumps from recovery/017_shm |
Date: | 2025-10-13 01:07:39 |
Message-ID: | CA+hUKGKzfkN6re3yboQ+9qbhV3+f8Qk__ZCApSKY+NoC1Y1thA@mail.gmail.com |
Views: | Whole Thread | Raw Message | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
While looking for something else, I noticed that we occasionally see
assertion failures like this:
TRAP: failed Assert("latch->maybe_sleeping == false"), File:
"latch.c", Line: 378, PID: 28023
Here's one in the build farm:
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=mamba&dt=2025-08-05%2005:52:51
And here are some recent cases on CI, which again fail somewhere else,
but that might be expected as these are cfbot branches from patches on
the mailing list:
task_id | task_name
------------------+---------------------------------
6347210574528512 | Linux - Debian Bookworm - Meson
6420333948829696 | FreeBSD - Meson
5616450825617408 | FreeBSD - Meson
4515661445070848 | Linux - Debian Bookworm - Meson
4945927242252288 | Linux - Debian Bookworm - Meson
5133563223343104 | Linux - Debian Bookworm - Meson
You can drop those task IDs into these URLs:
https://cirrus-ci.com/task/$TASK_ID
https://api.cirrus-ci.com/v1/artifact/task/$TASK_ID/testrun/build/testrun/recovery/017_shm/log/017_shm_gnat.log
My current theory is that backends are exiting when the test kills the
postmaster, but a backend that is concurrently starting up takes over
its latch, and then its first ResetLatch(MyLatch) fails that assertion
because maybe_sleeping was never cleared. So I suppose it should be
cleared in ... DisownLatch()?
That sails close to the topic in these threads:
https://www.postgresql.org/message-id/flat/B3C69B86-7F82-4111-B97F-0005497BB745%40yandex-team.ru
https://www.postgresql.org/message-id/flat/CA+hUKGKp0kTpummCPa97+WFJTm+uYzQ9Ex8UMdH8ZXkLwO0QgA(at)mail(dot)gmail(dot)com
If we didn't use proc_exit(), we wouldn't recycle the latch, so the
problem would go away with the new emergency cleanup solution I'm
working on (which incidentally also gets rid of the other source of
core dump spam that clogs up BF and CI systems: archive scripts and
other subprocesses of backends). More about that soon on that last
thread, but...
That would still leave versions 15-18 with these rare assertion
failures, since they have commit c8f3bc24. So I think the thing to do
is change DisownLatch() to clear maybe_sleeping just where it also
clears owner_pid, and backpatch that. Another idea would be to do it
in WaitEventSetWaitBlock() before exiting, but that'd be duplicated in
several places.
From | Date | Subject | |
---|---|---|---|
Next Message | Peter Smith | 2025-10-13 01:20:29 | Re: [PROPOSAL] Termination of Background Workers for ALTER/DROP DATABASE |
Previous Message | Daniele Varrazzo | 2025-10-13 01:06:49 | Failure building libpq v18.0 on old aarch64 |