Re: Anti-critical-section assertion failure in mcxt.c reached by walsender

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Thomas Munro <thomas(dot)munro(at)gmail(dot)com>
Cc: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, Noah Misch <noah(at)leadboat(dot)com>, Andrew Dunstan <andrew(at)dunslane(dot)net>
Subject: Re: Anti-critical-section assertion failure in mcxt.c reached by walsender
Date: 2021-05-07 01:43:32
Message-ID: 1523.1620351812@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Thomas Munro <thomas(dot)munro(at)gmail(dot)com> writes:
> While looking for something else, I noticed thorntail has failed twice
> like this, on REL_12_STABLE:
> TRAP: FailedAssertion("!(CritSectionCount == 0 ||
> (context)->allowInCritSection)", File:
> "/home/nm/farm/sparc64_deb10_gcc_64_ubsan/REL_12_STABLE/pgsql.build/../pgsql/src/backend/utils/mmgr/mcxt.c",
> Line: 931)

After failing to reproduce this locally, I went so far as to sign up
for a gcc compile farm account so I could try to reproduce it on the
machine running thorntail. I succeeded, after more than a few tries,
and here is the smoking gun:

#3 0x00000100007f792c in ExceptionalCondition (
conditionName=0x10000a38b80 "!(CritSectionCount == 0 || (context)->allowInCritSection)", errorType=0x1000087fb20 "FailedAssertion",
fileName=0x10000a38908 "mcxt.c", lineNumber=<optimized out>) at assert.c:54
#4 0x00000100008422f4 in palloc (size=64) at mcxt.c:931
#5 0x00000100001f5cec in XLogFileNameP (tli=<optimized out>, segno=1)
at xlog.c:10209
#6 0x00000100001f6220 in issue_xlog_fsync (fd=<optimized out>, segno=1)
at xlog.c:10186
#7 0x00000100001f6784 in XLogWrite (WriteRqst=..., flexible=<optimized out>)
at xlog.c:2607
#8 0x00000100001f793c in XLogFlush (record=23717128) at xlog.c:2926
#9 XLogFlush (record=23717128) at xlog.c:2802
#10 0x00000100001fe71c in XLogReportParameters () at xlog.c:9525
#11 StartupXLOG () at xlog.c:7805
#12 0x0000010000552d30 in StartupProcessMain () at startup.c:226
#13 0x0000010000215c1c in AuxiliaryProcessMain (argc=2, argv=0x7feffdc2f80)
at bootstrap.c:451

The interesting part of this is frame 6, which points here:

case SYNC_METHOD_FDATASYNC:
if (pg_fdatasync(fd) != 0)
ereport(PANIC,
(errcode_for_file_access(),
errmsg("could not fdatasync file \"%s\": %m",
XLogFileNameP(ThisTimeLineID, segno))));

So fdatasync() failed, and the code attempting to report that is not
critical-section-safe because it includes a palloc. Checking the state
of elog.c's error stack shows that the failure was errno = 5, or EIO.

Conclusions:

1. No wonder we could not reproduce it anywhere else. I've warned
the cfarm admins that their machine may be having hardware issues.

2. We evidently need to put a bit more effort into this error
reporting logic. More generally, I wonder how we could audit
the code for similar hazards elsewhere, because I bet there are
some. (Or ... could it be sane to run functions included in
the ereport's arguments in ErrorContext?)

3. One might wonder why we're getting an fdatasync failure at
all, when thorntail is configured to run with fsync = off.
The answer to that one is that 008_fsm_truncation.pl takes it
upon itself to force fsync = on, overriding the express wishes
of the buildfarm owner, not to mention general project policy.
AFAICT that was added with little if any thought in the initial
creation of 008_fsm_truncation.pl, and I think we should take
it out. There's certainly no visible reason for this one
TAP script to be running with fsync on when no others do.

> Unfortunately there is no libbacktrace in that release, and for some
> reason we don't see a core being analysed... (gdb not installed,
> looking for wrong core file pattern, ...?)

That I'm not sure about. gdb is certainly installed, and thorntail is
visibly running the current buildfarm client and is configured with the
correct core_file_glob, and I can report that the crash did leave a 'core'
file in the data directory (so it's not a case of systemd commandeering
the core dump). Seems like core-file collection should've worked
... unless maybe it's not covering TAP tests at all?

regards, tom lane

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Noah Misch 2021-05-07 02:28:31 Re: Anti-critical-section assertion failure in mcxt.c reached by walsender
Previous Message Masahiko Sawada 2021-05-07 00:39:56 Re: Replication slot stats misgivings