Re: Non-reproducible AIO failure

From: Thomas Munro <thomas(dot)munro(at)gmail(dot)com>
To: Konstantin Knizhnik <knizhnik(at)garret(dot)ru>
Cc: Alexander Lakhin <exclusion(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Nico Williams <nico(at)cryptonector(dot)com>, Daniel Gustafsson <daniel(at)yesql(dot)se>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, rmt(at)lists(dot)postgresql(dot)org
Subject: Re: Non-reproducible AIO failure
Date: 2025-08-24 12:38:16
Message-ID: CA+hUKGK6ujMT5myrEkgQ+n-N3rquZA4haHfJszQVe4ofHd6z6A@mail.gmail.com
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Sun, Aug 24, 2025 at 5:32 AM Konstantin Knizhnik <knizhnik(at)garret(dot)ru> wrote:
> On 20/08/2025 9:00 PM, Alexander Lakhin wrote:
> > for i in {1..10}; do np=$((20 + $RANDOM % 10)); echo "iteration $i:
> > $np"; time parallel -j40 --linebuffer --tag /tmp/repro-AIO-Assert.sh
> > {} ::: `seq $np` || break; sleep $(($RANDOM % 20)); done; echo -e "\007"
>
>
> Unfortunately I was not able to reproduce the problem on my MacBook M4
> Pro (which should be identical with yours) using your instructions: I
> tried five times but in all cases 10 iterations are completed normally.

I can't reproduce it either. I think the sequence is:

IO worker: pgaio_io_update_state(ioh, PGAIO_HS_COMPLETED_SHARED)
Backend: pgaio_io_wait(ioh, generation)
pgaio_io_was_recycled(ioh, generation, &state)
case for state == PGAIO_HS_COMPLETED_SHARED
pgaio_io_reclaim(boh)
pgaio_io_update_state(ioh, PGAIO_HS_IDLE)
ioh->op = PGAIO_OP_INVALID;
Backend: pgaio_io_acquire_nb()
Assert(ioh->state == PGAIO_HS_IDLE) *fails*

pgaio_io_update_state() is essentially:

pg_write_barrier();
ioh->state = new_state;

I looked at the code generated by a non-optimised build:

(lldb) dis --name pgaio_io_update_state
...
postgres[0x100687fc4] <+372>: dmb ish XXX memory barrier
postgres[0x100687fc8] <+376>: ldur w10, [x29, #-0xc]
postgres[0x100687fcc] <+380>: ldur x9, [x29, #-0x8]
postgres[0x100687fd0] <+384>: ldrh w8, [x9] XXX load
state + target
postgres[0x100687fd4] <+388>: ldrb w11, [x9, #0x2] XXX load op
postgres[0x100687fd8] <+392>: orr w8, w8, w11, lsl #16
postgres[0x100687fdc] <+396>: and w10, w10, #0xff
postgres[0x100687fe0] <+400>: and w8, w8, #0xffffff00
postgres[0x100687fe4] <+404>: orr w10, w8, w10
postgres[0x100687fe8] <+408>: lsr w8, w10, #16
postgres[0x100687fec] <+412>: strh w10, [x9] XXX store
state + target
postgres[0x100687ff0] <+416>: strb w8, [x9, #0x2] XXX store op
postgres[0x100687ff4] <+420>: ldp x29, x30, [sp, #0x60]
postgres[0x100687ff8] <+424>: add sp, sp, #0x70
postgres[0x100687ffc] <+428>: ret

Even the -O3 version does that, it just uses fancier instructions:

postgres[0x100687c08] <+316>: dmb ish
postgres[0x100687c0c] <+320>: ldurh w8, [x20, #0x1]
postgres[0x100687c10] <+324>: bfi w19, w8, #8, #16
postgres[0x100687c14] <+328>: lsr w8, w8, #8
postgres[0x100687c18] <+332>: strb w8, [x20, #0x2] XXX clobber op
postgres[0x100687c1c] <+336>: strh w19, [x20]

Could it be that the store buffer was flushed between the two stores,
pgaio_io_was_recycled() saw the new state, pgaio_io_reclaim() assigned
ioh->op = PGAIO_OP_INVALID and it was flushed to L1, and then finally
the time travelling op value was flushed and clobbered it?

Apple clang 17 output (= LLVM 19) with optimisation is smart enough to
use a single store:

postgres[0x1002f0ae8] <+308>: dmb ish
postgres[0x1002f0aec] <+312>: strb w19, [x20]

That's also how open source clang 17 compiles it if you rip out the
bitfield. I guess if you do that, you won't be able to reproduce
this, Alexander? Something like:

struct PgAioHandle
{
/* all state updates should go through pgaio_io_update_state() */
- PgAioHandleState state:8;
+ uint8 state;

/* what are we operating on */
- PgAioTargetID target:8;
+ uint8 target;

/* which IO operation */
- PgAioOp op:8;
+ uint8 op;

Curiously, Apple clang 17 without optimisation does some amount of
unnecessary clobbering, but not the whole bitset and it's only a
single 16 bit store.

postgres[0x10057bfb0] <+348>: dmb ish
postgres[0x10057bfb4] <+352>: ldur w10, [x29, #-0xc]
postgres[0x10057bfb8] <+356>: ldur x9, [x29, #-0x8]
postgres[0x10057bfbc] <+360>: ldrh w8, [x9]
postgres[0x10057bfc0] <+364>: and w10, w10, #0xff
postgres[0x10057bfc4] <+368>: and w8, w8, #0xffffff00
postgres[0x10057bfc8] <+372>: orr w8, w8, w10
postgres[0x10057bfcc] <+376>: strh w8, [x9]

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Marc-Olaf Jaschke 2025-08-24 13:03:42 Re: Improve hash join's handling of tuples with null join keys
Previous Message Srinath Reddy Sadipiralla 2025-08-24 12:28:44 Re: Potential problem in commit f777d773878 and 4f7f7b03758