From: | Konstantin Knizhnik <knizhnik(at)garret(dot)ru> |
---|---|
To: | Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de> |
Cc: | Nico Williams <nico(at)cryptonector(dot)com>, Alexander Lakhin <exclusion(at)gmail(dot)com>, Daniel Gustafsson <daniel(at)yesql(dot)se>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, rmt(at)lists(dot)postgresql(dot)org |
Subject: | Re: Non-reproducible AIO failure |
Date: | 2025-08-28 12:57:27 |
Message-ID: | b133ca5f-f184-4a6c-b3e6-da4847cb47b2@garret.ru |
Views: | Whole Thread | Raw Message | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On 28/08/2025 3:08 AM, Thomas Munro wrote:
> On Thu, Aug 28, 2025 at 11:08 AM Andres Freund <andres(at)anarazel(dot)de> wrote:
>> On 2025-08-26 16:59:54 +0300, Konstantin Knizhnik wrote:
>>> Still it is not quite clear to me how bitfields can cause this issue.
>> Same.
> Here's what I speculated after reading the generated asm[1]:
>
> "Could it be that the store buffer was flushed between the two stores,
> pgaio_io_was_recycled() saw the new state, pgaio_io_reclaim() assigned
> ioh->op = PGAIO_OP_INVALID and it was flushed to L1, and then finally
> the time travelling op value was flushed and clobbered it?"
>
> To put it in terms of cache line modes and store buffer operations,
> the IO worker does this, omitting uninteresting instructions:
>
> postgres[0x100687fd0] <+384>: ldrh w8, [x9] ; load state, target
> postgres[0x100687fd4] <+388>: ldrb w11, [x9, #0x2] ; load op
> ... build new state + target in w10 ....
> postgres[0x100687fec] <+412>: strh w10, [x9] ; state <- COMPLETED_SHARED
> postgres[0x100687ff0] <+416>: strb w8, [x9, #0x2] ; op <- PGAIO_OP_READV
>
> My speculation was that the two stores hit memory in separate store
> buffer flushes for whatever reason (something to do with a context
> switch, or maybe the store buffer is just full, or... who knows,
> doesn't matter). Cache line exclusive mode is released and
> re-acquired in between the two flushes. I know that merely executing
> a store instruction doesn't acquire cache line exclusive mode, I'm
> talking specifically about store buffer flush operations here, which
> must.
>
> In that window, pgaio_io_wait() in the owner backend sees state ==
> COMPLETED_SHARED and runs pgaio_io_reclaim() which then stores and
> flushes op <- PGAIO_OP_IDLE. The IO worker core's second flush
> operation is delayed because it has to wait to re-acquire the cache
> line in exclusive mode, ie wait for the owner backend to release it,
> and then it clobbers op with the old value, later producing this
> assertion failure in the owner backend:
>
> TRAP: failed Assert("ioh->op == PGAIO_OP_INVALID"), File: "aio_io.c",
> Line: 167, PID: 56420
>
> To put it another way, this could only work correctly if the IO worker
> did: store op (unnecessarily because bitfield blah blah), dmb ish,
> store state, pairing with the owner's dmb ishld, load state. But we
> have no control over that, the fact that the compiler generated two
> stores for our byte-sized assignment is invisible from C.
I am convinced by your speculations:)
And it explains why replacing of bitfields with uint8 eliminates the
problem despite to the fact that it is not possible to update individual
byte at RISC and cache line. The critical thing here is that op and
state are updated independently.
If state field is byte or "good" version of clang (or enabling
optimization) eliminates extracting and storing redundant bitfields,
then the problem describe above is not possible. Even through `state`
and `op` can be updated together but it is atomic update!
No other process can see new value of `state`, do assignment to `op` and
some time later receive delayed update for `op`, which overwrites new value.
If it is true, then patch proposed by Andres should solve the issue (and
simplify generated code).
From | Date | Subject | |
---|---|---|---|
Next Message | BharatDB | 2025-08-28 13:01:55 | Reduce "Var IS [NOT] NULL" quals during constant folding |
Previous Message | Tomas Vondra | 2025-08-28 12:45:24 | Re: index prefetching |