From: | Andres Freund <andres(at)anarazel(dot)de> |
---|---|
To: | Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> |
Cc: | Konstantin Knizhnik <knizhnik(at)garret(dot)ru>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org> |
Subject: | Re: Non-reproducible AIO failure |
Date: | 2025-06-17 13:35:55 |
Message-ID: | of6nnksyqlbqikhpiwspalskgtx5dax6te2dwn3ojmj5k7obh4@hrteef7hiwvp |
Views: | Whole Thread | Raw Message | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
Hi,
On 2025-06-16 20:22:00 -0400, Tom Lane wrote:
> Konstantin Knizhnik <knizhnik(at)garret(dot)ru> writes:
> > On 16/06/2025 6:11 pm, Andres Freund wrote:
> >> I unfortunately can't repro this issue so far.
>
> > But unfortunately it means that the problem is not fixed.
>
> FWIW, I get similar results to Andres' on a Mac Mini M4 Pro
> using MacPorts' current compiler release (clang version 19.1.7).
> The currently-proposed test case fails within a few minutes on
> e9a3615a5^ but doesn't fail in a couple of hours on e9a3615a5.
I'm surprised it takes that long, given it takes seconds to reproduce here
with the config parameters I outlined. Did you try crank up the concurrency a
bit? Yours has more cores than mine, and I found that that makes a huge
difference.
> However, I cannot repro that on a slightly older Mini M1 using Apple's
> current release (clang-1700.0.13.5, which per wikipedia is really LLVM
> 19.1.4). It seems to work fine even without e9a3615a5. So the whole
> thing is still depressingly phase-of-the-moon-dependent.
It's not entirely surprising that an M1 would have a harder time reproducing
the issue, more cores, larger caches and a larger out-of-order execution
window will make it more likely that the missing memory barriers have a
visible effect.
I'm reasonably sure that e9a3615a5 quashed that specific issue - I could repro
it within seconds with e9a3615a5^ and with e9a3615a5 I ran it for several days
without a single failure...
> I don't doubt that Konstantin has found a different issue, but
> it's hard to be sure about the fix unless we can get it to be
> more reproducible. Neither of my machines has ever shown the
> symptom he's getting.
I've not been able to reproduce that symptom a single time either so far.
The assertion continues to be inexplicable to me. It shows, within a single
process, memory in shared memory going "backwards". But not always, just very
occasionally. Because this is before the IO is defined, there's no concurrent
access whatsoever.
I stole^Wgot my partner's m1 macbook for a bit, trying to reproduce the issue
there. It has
"Apple clang version 16.0.0 (clang-1600.0.26.6)"
on
"Darwin Kernel Version 24.3.0"
That's the same Apple-clang version that Alexander reported being able to
reproduce the issue on [1], but unfortunately it's a newer kernel version. No
dice in the first 55 test iterations.
Konstantin, Alexander - are you using the same device to reproduce this or
different ones? I wonder if this somehow depends on some MDM / corporate
enforcement tooling running or such.
What does:
- profiles status -type enrollment
- kextstat -l
show?
Greetings,
Andres Freund
[1] https://postgr.es/m/92b33ab2-0596-40fe-9db6-a6d821d08e8a%40gmail.com
From | Date | Subject | |
---|---|---|---|
Next Message | Dimitrios Apostolou | 2025-06-17 13:37:41 | Re: --enable-{debug,cassert} should also activate --enable-depend |
Previous Message | Ranier Vilela | 2025-06-17 13:31:13 | Re: Avoid possible dereference null pointer (src/backend/utils/cache/relcache.c) |