Quick Links

Core dumps from recovery/017_shm

From:	Thomas Munro <thomas(dot)munro(at)gmail(dot)com>
To:	PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject:	Core dumps from recovery/017_shm
Date:	2025-10-13 01:07:39
Message-ID:	CA+hUKGKzfkN6re3yboQ+9qbhV3+f8Qk__ZCApSKY+NoC1Y1thA@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

While looking for something else, I noticed that we occasionally see
assertion failures like this:

TRAP: failed Assert("latch->maybe_sleeping == false"), File:
"latch.c", Line: 378, PID: 28023

Here's one in the build farm:

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=mamba&dt=2025-08-05%2005:52:51

And here are some recent cases on CI, which again fail somewhere else,
but that might be expected as these are cfbot branches from patches on
the mailing list:

You can drop those task IDs into these URLs:

https://cirrus-ci.com/task/$TASK_ID
https://api.cirrus-ci.com/v1/artifact/task/$TASK_ID/testrun/build/testrun/recovery/017_shm/log/017_shm_gnat.log

My current theory is that backends are exiting when the test kills the
postmaster, but a backend that is concurrently starting up takes over
its latch, and then its first ResetLatch(MyLatch) fails that assertion
because maybe_sleeping was never cleared. So I suppose it should be
cleared in ... DisownLatch()?

That sails close to the topic in these threads:

https://www.postgresql.org/message-id/flat/B3C69B86-7F82-4111-B97F-0005497BB745%40yandex-team.ru
https://www.postgresql.org/message-id/flat/CA+hUKGKp0kTpummCPa97+WFJTm+uYzQ9Ex8UMdH8ZXkLwO0QgA(at)mail(dot)gmail(dot)com

If we didn't use proc_exit(), we wouldn't recycle the latch, so the
problem would go away with the new emergency cleanup solution I'm
working on (which incidentally also gets rid of the other source of
core dump spam that clogs up BF and CI systems: archive scripts and
other subprocesses of backends). More about that soon on that last
thread, but...

That would still leave versions 15-18 with these rare assertion
failures, since they have commit c8f3bc24. So I think the thing to do
is change DisownLatch() to clear maybe_sleeping just where it also
clears owner_pid, and backpatch that. Another idea would be to do it
in WaitEventSetWaitBlock() before exiting, but that'd be duplicated in
several places.

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Peter Smith	2025-10-13 01:20:29	Re: [PROPOSAL] Termination of Background Workers for ALTER/DROP DATABASE
Previous Message	Daniele Varrazzo	2025-10-13 01:06:49	Failure building libpq v18.0 on old aarch64