Re: Anti-critical-section assertion failure in mcxt.c reached by walsender

From: Noah Misch <noah(at)leadboat(dot)com>
To: Andres Freund <andres(at)anarazel(dot)de>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, Andrew Dunstan <andrew(at)dunslane(dot)net>
Subject: Re: Anti-critical-section assertion failure in mcxt.c reached by walsender
Date: 2021-05-08 03:30:44
Message-ID: 20210508033044.GA3082635@rfd.leadboat.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Fri, May 07, 2021 at 10:18:14PM -0400, Tom Lane wrote:
> Andres Freund <andres(at)anarazel(dot)de> writes:
> > On 2021-05-07 17:14:18 -0700, Noah Misch wrote:
> >> Having a flaky buildfarm member is bad news. I'll LD_PRELOAD the attached to
> >> prevent fsync from reaching the kernel. Hopefully, that will make the
> >> hardware-or-kernel trouble unreachable. (Changing 008_fsm_truncation.pl
> >> wouldn't avoid this, because fsync=off doesn't affect syncs outside the
> >> backend.)
>
> > Not sure how reliable that is - there's other paths that could return an
> > error, I think.

Yep, one can imagine a failure at close() or something. All the non-HEAD
buildfarm failures are at some *sync call, so I'm optimistic about getting
mileage from this. (I didn't check the more-numerous HEAD failures.) If it's
not enough, I may move the farm directory to tmpfs.

> > If the root cause is the disk responding weirdly to
> > write cache flushes, you could tell the kernel that that the disk has no
> > write cache (e.g. echo write through > /sys/block/sda/queue/write_cache).
>
> I seriously doubt Noah has root on that machine.

If I can make the case for that setting being a good thing for the VM's users
generally, I probably can file a ticket and get it done.

> More to the point, the admin told me it's a VM (or LDOM, whatever that is)
> under a Solaris host, so there's no direct hardware access going on
> anyway. He didn't say in so many words, but I suspect the reason he's
> suspecting kernel bugs is that there's nothing going wrong so far as the
> host OS is concerned.

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message houzj.fnst@fujitsu.com 2021-05-08 03:38:51 Inaccurate error message when set fdw batch_size to 0
Previous Message David Rowley 2021-05-08 03:26:57 Re: plan with result cache is very slow when work_mem is not enough