Re: [RFC] building postgres with meson - v12

From: Andres Freund <andres(at)anarazel(dot)de>
To: Peter Eisentraut <peter(dot)eisentraut(at)enterprisedb(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org, samay sharma <smilingsamay(at)gmail(dot)com>, Nazir Bilal Yavuz <byavuz81(at)gmail(dot)com>
Subject: Re: [RFC] building postgres with meson - v12
Date: 2022-09-09 23:58:36
Message-ID: 20220909235836.lz3igxtkcjb5w7zb@awork3.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

On 2022-08-31 11:11:54 -0700, Andres Freund wrote:
> > If the above are addressed, I think this will be just about at the
> > point where the above patches can be committed.
>
> Woo!

There was a lot less progress over the last ~week than I had hoped. The reason
is that I was trying to figure out the reason for the occasional failures of
ecpg tests getting compiled when building on windows in CI, with msbuild.

I went into many layers of rabbitholes while investigating. Wasting an absurd
amount of time.

The problem:

Occasionally ecpg test files would fail to compile, exiting with -1073741819:
C:\BuildTools\MSBuild\Microsoft\VC\v160\Microsoft.CppCommon.targets(241,5): error MSB8066: Custom build for 'C:\cirrus\build\meson-private\custom_target.rule' exited with code -1073741819. [c:\cirrus\build\src\interfaces\ecpg\test\sql\3701597@@twophase(dot)c(at)cus(dot)vcxproj]

-1073741819 is 0xc0000005, which in turn is STATUS_ACCESS_VIOLATION, i.e. a
segfault. This happens in roughly 1/3 of the builds, but with "streaks" of not
happening and more frequently happening.

However, despite our CI images having a JIT debugger configured (~coredump
handler), no crash report was triggered. The problem never occurs in my
windows VM.

At first I thought that might be because it's an assertion failure or such,
which only causes a dump when a bunch of magic is done (see main.c). But
despite adding all the necessary magic to ecpg.exe, no dump.

Unfortunately, adding debug output reduces the frequency of the issue
substantially.

Eventually I figured out that it's not actually ecpg.exe that is crashing. It
is meson's python wrapper around built binaries as part of the build (for
setting PATH, working directory, without running into cmd.exe issues). A
modified meson wrapper showed that ecpg.exe completes successfully.

The only thing the meson wrapper does after running the command is to call
sys.exit(returncode), and I had printed out the returncode, which is 0.

I looked through a lot of the python code, to see why no crashdump and no
details are forthcoming. There weren't any relevant
SetErrorMode(SEM_NOGPFAULTERRORBOX) calls. I tried to set PYTHONFAULTHANDLER,
but still no stack trace.

Next I suspected that cmd.exe might be crashing and causing the
problem. Modified meson to add 'echo %ERRORLEVEL%' to the msbuild
custombuild. Which indeed shows the STATUS_ACCESS_VIOLATION returncode after
running python. So it's not cmd.exe.

The problem even persisted when replacing meson's sys.exit() with os._exit(),
which indeed just calls _exit().

I tried to reproduce the problem using a python with debugging enabled. The
problem doesn't occur despite quite a few runs.

I found scattered other reports of this problem happening on windows. Went
down a few more rabbitholes. Too boring to repeat here.

At this point I finally figured out that the reason the crash reports don't
happen is that everythin started by cirrus-ci on windows has an errormode of
SEM_FAILCRITICALERRORS | SEM_NOGPFAULTERRORBOX | SEM_NOOPENFILEERRORBOX.

A good bit later I figured out that while cirrus-ci isn't intentionally
setting that, golang does so *unconditionally* on windows:
https://github.com/golang/go/blob/54182ff54a687272dd7632c3a963e036ce03cb7c/src/runtime/signal_windows.go#L14
https://github.com/golang/go/blob/54182ff54a687272dd7632c3a963e036ce03cb7c/src/runtime/os_windows.go#L553
Argh. I should have checked what the error mode is earlier, but this is just
very sneaky.

So I modified meson to change the errormode and tried to reproduce the issue
again, to finally get a stackdump. And tried again. And again. Without a
single relevant failure (I saw tests fail in ways that are discussed on the
list, but that's irrelevant here).

I've run this through enough attempts by now that I'm quite confident that the
problem does not occur when the errormode does not include
SEM_NOOPENFILEERRORBOX. I'll want a few more runs to be certain, but...

Given that the problem appears to happen after _exit() is called, and only
when SEM_NOOPENFILEERRORBOX is not set, it seems likely to be an OS / C
runtime bug. Presumably it's related to something that python does first, but
I don't see how anything could justify crashing only if SEM_NOOPENFILEERRORBOX
is set (rather than the opposite).

I have no idea how to debug this further, given that the problem is quite rare
(can't attach a debugger and wait), only happens when crashdumps are prevented
from happening (so no idea where it crashes) and is made less common by debug
printfs.

So for now the best way forward I can see is to change the error mode for CI
runs. Which is likely a good idea anyway, so we can see crashdumps for
binaries other than postgres.exe (which does SetErrorMode() internally). I
managed to do so by setting CIRRUS_SHELL to a python wrapper around cmd.exe
that does SetErrorMode(). I'm sure there's easier ways, but I couldn't figure
out any.

I'd like to reclaim my time. But I'm afraid nobody will be listening to that
plea...

Greetings,

Andres Freund

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Melih Mutlu 2022-09-09 23:59:50 Re: Summary function for pg_buffercache
Previous Message Jacob Champion 2022-09-09 23:21:32 Re: Patch proposal: make use of regular expressions for the username in pg_hba.conf