From: | Andres Freund <andres(at)anarazel(dot)de> |
---|---|
To: | Konstantin Knizhnik <knizhnik(at)garret(dot)ru> |
Cc: | Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, pgsql-hackers(at)lists(dot)postgresql(dot)org |
Subject: | Re: Non-reproducible AIO failure |
Date: | 2025-06-10 17:41:07 |
Message-ID: | pzudwiqm4lgre6syrdvoii3gsauq2nhgms4hrltmv6znwwkqfk@gvwwv424hqq2 |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
Hi,
On 2025-06-10 17:28:11 +0300, Konstantin Knizhnik wrote:
> On 09/06/2025 2:05 am, Thomas Munro wrote:
> > On Sat, Jun 7, 2025 at 6:47 AM Andres Freund <andres(at)anarazel(dot)de> wrote:
> > > On 2025-06-06 14:03:12 +0300, Konstantin Knizhnik wrote:
> > > > There is really essential difference in code generated by clang 15 (working)
> > > > and 16 (not working).
> > > There also are code gen differences between upstream clang 17 and apple's
> > > clang, which is based on llvm 17 as well (I've updated the toolchain, it
> > > repros with that as well).
> > Just for the record, Apple clang 17 (self-reported clobbered version)
> > is said to be based on LLVM 19[1]. For a long time it was off by one
> > but tada now it's apparently two. Might be relevant if people are
> > comparing generated code up that close....
You've got to be kidding me. Because the world otherwise would be too easy, I
guess.
> > . o O (I wonder if one could corroborate that by running "strings" on
> > upstream clang binaries (as compiled by MacPorts/whatever) for each
> > major version and finding new strings, ie strings that don't appear in
> > earlier major versions, and then seeing which ones are present in
> > Apple's clang binaries... What a silly problem.)
> >
> > [1] https://en.wikipedia.org/wiki/Xcode#Xcode_15.0_-_16.x_(since_visionOS_support)
>
>
> Some updates: I was able to reproduce the problem at my Mac with old clang
> (15.0) but only with disabled optimization (CFLAGS=-O0).
> So very unlikely it is bug in compiler.
I was able to reproduce it with gcc, too.
> Why it is better reproduced in debug build? May be because of timing.
Code-gen wise the biggest change I see is that there is more stack spilling
due to assertion related code...
> Or may be because without optimization compiler is doing stupid things:
> loads all three bitfields from memory to register (one half word+one byte),
> then does some manipulations with this register and writes it back to
> memory. Can register somehow be clobbered between read and write (for
> example by signal handler)? Very unlikely...
> So still do not have any good hypothesis.
>
> But with bitfields replaced with uint8 the bug is not reproduced any more.
> May be just do this change (which seems to be good thing in any case)?
I've reproduced it without that bitfield, unfortunately :(.
Unfortunately my current set of debugging output seems to have prevented the
issue from re-occurring. Need to pare it down to re-trigger. But for me it
only reproduces relatively rarely, so paring down the debug output is a rather
slow process :(
This is really a peculiar issue. I've now ran 10s of thousands of non-macos
iterations, without triggering this or a related issue even once. The one good
news is that currently the regression tests are remarkably stable, I think in
the past I hardly could have run that many iterations without (independent)
failures.
Greetings,
Andres Freund
From | Date | Subject | |
---|---|---|---|
Next Message | Jeff Davis | 2025-06-10 17:48:26 | Re: Remaining dependency on setlocale() |
Previous Message | David G. Johnston | 2025-06-10 17:27:34 | Re: pg_restore causing ENOSPACE on the WAL partition. Fundamental issue? |