Core dumps from recovery/017_shm

From: Thomas Munro <thomas(dot)munro(at)gmail(dot)com>
To: PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Core dumps from recovery/017_shm
Date: 2025-10-13 01:07:39
Message-ID: CA+hUKGKzfkN6re3yboQ+9qbhV3+f8Qk__ZCApSKY+NoC1Y1thA@mail.gmail.com
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

While looking for something else, I noticed that we occasionally see
assertion failures like this:

TRAP: failed Assert("latch->maybe_sleeping == false"), File:
"latch.c", Line: 378, PID: 28023

Here's one in the build farm:

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=mamba&dt=2025-08-05%2005:52:51

And here are some recent cases on CI, which again fail somewhere else,
but that might be expected as these are cfbot branches from patches on
the mailing list:

task_id | task_name
------------------+---------------------------------
6347210574528512 | Linux - Debian Bookworm - Meson
6420333948829696 | FreeBSD - Meson
5616450825617408 | FreeBSD - Meson
4515661445070848 | Linux - Debian Bookworm - Meson
4945927242252288 | Linux - Debian Bookworm - Meson
5133563223343104 | Linux - Debian Bookworm - Meson

You can drop those task IDs into these URLs:

https://cirrus-ci.com/task/$TASK_ID
https://api.cirrus-ci.com/v1/artifact/task/$TASK_ID/testrun/build/testrun/recovery/017_shm/log/017_shm_gnat.log

My current theory is that backends are exiting when the test kills the
postmaster, but a backend that is concurrently starting up takes over
its latch, and then its first ResetLatch(MyLatch) fails that assertion
because maybe_sleeping was never cleared. So I suppose it should be
cleared in ... DisownLatch()?

That sails close to the topic in these threads:

https://www.postgresql.org/message-id/flat/B3C69B86-7F82-4111-B97F-0005497BB745%40yandex-team.ru
https://www.postgresql.org/message-id/flat/CA+hUKGKp0kTpummCPa97+WFJTm+uYzQ9Ex8UMdH8ZXkLwO0QgA(at)mail(dot)gmail(dot)com

If we didn't use proc_exit(), we wouldn't recycle the latch, so the
problem would go away with the new emergency cleanup solution I'm
working on (which incidentally also gets rid of the other source of
core dump spam that clogs up BF and CI systems: archive scripts and
other subprocesses of backends). More about that soon on that last
thread, but...

That would still leave versions 15-18 with these rare assertion
failures, since they have commit c8f3bc24. So I think the thing to do
is change DisownLatch() to clear maybe_sleeping just where it also
clears owner_pid, and backpatch that. Another idea would be to do it
in WaitEventSetWaitBlock() before exiting, but that'd be duplicated in
several places.

Browse pgsql-hackers by date

  From Date Subject
Next Message Peter Smith 2025-10-13 01:20:29 Re: [PROPOSAL] Termination of Background Workers for ALTER/DROP DATABASE
Previous Message Daniele Varrazzo 2025-10-13 01:06:49 Failure building libpq v18.0 on old aarch64