From: | Konstantin Knizhnik <knizhnik(at)garret(dot)ru> |
---|---|
To: | Andres Freund <andres(at)anarazel(dot)de> |
Cc: | PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org> |
Subject: | Re: Non-reproducible AIO failure |
Date: | 2025-06-16 15:48:27 |
Message-ID: | d9bc9a1a-8d35-4a6a-b37e-6e53b2690525@garret.ru |
Views: | Whole Thread | Raw Message | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On 16/06/2025 6:11 pm, Andres Freund wrote:
> Hi,
>
> On 2025-06-16 14:11:39 +0300, Konstantin Knizhnik wrote:
>> One more update: with the proposed patch (memory barrier before
>> `ConditionVariableBroadcast` in `pgaio_io_process_completion`
> I don't see how that barrier could be required for correctness -
> ConditionVariableBroadcast() is a barrier itself (as the comment above the
> call notes, too).
>
>
> On 2025-06-15 14:48:43 +0300, Konstantin Knizhnik wrote:
>> Also I think that replacing bitfields with `uint8` and may be even with
>> `int`, is good idea at least to avoids false sharing.
> I don't think there's false sharing here. And even if there were, the
> granularity at which false sharing occurs is a cache line size, so either 64
> or 128 byte.
>
>
> I unfortunately can't repro this issue so far. I don't think it's the same
> problem as my patch fixes, so I'll push my patch. How exactly did you
> reproduce the probelm?
>
> Greetings,
>
> Andres Freund
Sorry, I was not sure that spinlock (used in
`ConditionVariableBroadcast`) enforces memory barrier. Certainly in this
case adding memory barrir here is not needed.
Concerning false sharing - I suspected that compiler can extract
bitfield from the word loaded before read barrier. But it seems to be
not possible (if barrier is correctl recognized by compiler) and more
over - I have reproduced it with didsabled optimization, so unlikely
compiler tries to eliminate some reads here. And I agree with your
argument about cache line: even replacing uint8 with int will not
prevent it.
But unfortunately it means that the problem is not fixed. I just did the
same test as you proposed:
c=16; pgbench -c $c -j $c -M prepared -n -f <(echo "select count(*) FROM large;") -T 10000 -P 10
with the following config changes:
io_max_concurrency=1
io_combine_limit=1
synchronize_seqscans=false
restart_after_crash=false
max_parallel_workers_per_gather=0
fsync=off
Postgres was build with the following options: --without-icu --enable-debug --enable-cassert CFLAGS=-O0
As I wrote - it takes about 10000 seconds to get this assertion failure.
I can try to do it once again.
Looks like the problem is better reproduced with disabled optimizations.
From | Date | Subject | |
---|---|---|---|
Next Message | Tomas Vondra | 2025-06-16 15:58:34 | Re: Amcheck verification of GiST and GIN |
Previous Message | Christoph Berg | 2025-06-16 15:43:54 | Re: CHECKPOINT unlogged data |