Re: Need help debugging SIGBUS crashes

From: "Peter 'PMc' Much" <pmc(at)citylink(dot)dinoex(dot)sub(dot)org>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Tomas Vondra <tomas(at)vondra(dot)me>, pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject: Re: Need help debugging SIGBUS crashes
Date: 2026-04-01 00:03:26
Message-ID: acxgzmNqBCuRGCf6@disp.intra.daemon.contact
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, Mar 17, 2026 at 04:56:48PM -0400, Tom Lane wrote:
! "Peter 'PMc' Much" <pmc(at)citylink(dot)dinoex(dot)sub(dot)org> writes:
! > On Tue, Mar 17, 2026 at 10:12:07AM -0400, Tom Lane wrote:
! > ! Why it was okay in older FreeBSD and not so much in v14, who knows?
!
! > Maybe it wasn't. Here it appeared out of thin air in February, while
! > the system was upgraded from 13.5 to 14.3 in July'25, and did run
! > without problems for these eight months.
! > So this is not directly or solely related to FBSD R.14, and while it
! > happens more likely during massive memory use, but this also is not
! > stingent. Neither did I find any other solid determining condition.
!
! Yeah, it seems likely that there is some additional triggering
! condition that we don't understand; otherwise there would be more
! people complaining than just you.

Dear hackers ;)

I have now analyzed three of the memory dumps from servers crashing;
that means, I walked through the actual code of malloc() and did all
the computations manually, in order to understand why and where a
SIGBUS would be triggered.

What I found is an area of memory where jemalloc stores a lookup tree,
about 4 or 8 MB long. That area is zeroed, and sparsely populated by
pointers to other memory locations, which jemalloc uses.
But within this area are one or two 4kB-pages which contain data that
does not belong there. That data is slightly structured, but there is
no unique signature by which I could identify an owner - it is not
fully random, but quite random, and also very different between the
three crashes.
When a memory pointer is fetched out of that area, it can point to
anywhere, and that explains why utilizing such a pointer gives either
SIGSEGV or SIGBUS.

There is also one other person who has perceived the exact same
backtraces (and attributed them to autovacuum, and filed a bug report
against FreeBSD) - this rules out a possible hardware issue.
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=294039

The serious abdominal pain that I currently have is this: when
something can replace pages in a table used internally by jemalloc,
can it also replace pages in memory which are vital to the database
itself? In other words, can this lead to silent data corruption?

In my samples I found about 0.1% of the memory corrupted, and also
I still assume that there is an additional factor of memory exhaustion
involved. This together might explain why the observations happen
rarely.

For now it is confirmed that the crashes may happen in Freebsd 14.3,
14.4 and 15.0, and with PG r14, r15 and r16.

Furthermore (as You can read in the mentioned bug report) our PG
maintainer Palle Girgenson had the idea that an Errata advisory
FreeBSD-EN-26:03.vm might possibly be causing the issues. The
installation of that patch aligns well with the appearance of the
crashes.

For now I have removed that patch from my kernel, and am hammering
onto the database, without another crash, for nearly two days now -
but that is still too short to say anything with certainty.

I am unsure about what to do next. In the worst case scenario quite
a bunch of professional installations might be in subtle danger,
so maybe something should be done?
Certainly, I could as well decide that this patch removal (hopefully)
solves my issue, and so I am now (hopefully) done with this, and go
to sleep again, as everybody else may just care for themselves...

I'll be thankful for inspirations.

cheerio,
PMc

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Amit Langote 2026-04-01 00:06:49 Re: Eliminating SPI / SQL from some RI triggers - take 3
Previous Message Masahiko Sawada 2026-04-01 00:03:23 Re: Initial COPY of Logical Replication is too slow