Quick Links

Re: Non-reproducible AIO failure

From:	Konstantin Knizhnik <knizhnik(at)garret(dot)ru>
To:	Andres Freund <andres(at)anarazel(dot)de>
Cc:	PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject:	Re: Non-reproducible AIO failure
Date:	2025-06-17 14:54:12
Message-ID:	2938968e-8010-4d14-a211-baa2d5778e1b@garret.ru
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

On 12/06/2025 4:57 pm, Andres Freund wrote:
> The problem appears to be in that switch between "when submitted, by the IO
> worker" and "then again by the backend". It's not concurrent access in the
> sense of two processes writing to the same value, it's that when switching
> from the worker updating ->distilled_result to the issuer looking at that, the
> issuer didn't ensure that no outdated version of ->distilled_result could be
> used.
>
> Basically, the problem is that the worker would
>
> 1) set ->distilled_result
> 2) perform a write memory barrier
> 3) set ->state to COMPLETED_SHARED
>
> and then the issuer of the IO would:
>
> 4) check ->state is COMPLETED_SHARED
> 5) use ->distilled_result
>
> The problem is that there currently is no barrier between 4 & 5, which means
> an outdated ->distilled_result could be used.
>
>
> This also explains why the issue looked so weird - eventually, after fprintfs,
> after a core dump, etc, the updated ->distilled_result result would "arrive"
> in the issuing process, and suddenly look correct.
>
Sorry, I realized that O do not completely understand how it can
explained assertion failure in `pgaio_io_before_start`:

Assert(ioh->op == PGAIO_OP_INVALID);

I am not so familiar with AIO implementation in Postgress, so please
excuse me if I missing something...

Ok, `distilled_result` contains some deteriorated data.
As far as I understand it is used for calling callbacks.

Right now in debugger (loaded core file), I see thsat two callbacks are
registered: `aio_md_readv_cb` and `aio_shared_buffer_readv_cb`.
None of them is changing "op" field.

This is full content of the handle:

(PgAioHandle) {
state = PGAIO_HS_HANDED_OUT
target = PGAIO_TID_SMGR
op = PGAIO_OP_READV
flags = '\x04'
num_callbacks = '\x02'
callbacks = {
    [0] = '\x02'
    [1] = '\x01'
    [2] = '\0'
    [3] = '\0'
}
callbacks_data = {
    [0] = '\0'
    [1] = '\0'
    [2] = '\0'
    [3] = '\0'
}
handle_data_len = '\x01'
owner_procno = 28
result = 0
node = {
    prev = 0x000000010c9ea368
    next = 0x000000010c9ea368
}
resowner = 0x000000014880ef28
resowner_node = {
    prev = 0x000000014880f1d8
    next = 0x000000014880f1d8
}
generation = 133159566
cv = {
    mutex = 0
    wakeup = (head = -1, tail = -1)
}
distilled_result = (id = 0, status = 0, error_data = 0, result = 1)
iovec_off = 448
report_return = 0x000000010f81b950
op_data = {
    read = (fd = 7, iov_length = 1, offset = 180936704)
    write = (fd = 7, iov_length = 1, offset = 180936704)
}
target_data = {
    smgr = {
      rlocator = (spcOid = 1663, dbOid = 5, relNumber = 16384)
      blockNum = 22088
      nblocks = 1
      forkNum = MAIN_FORKNUM
      is_temp = false
      skip_fsync = false
    }
}
}

Actually `PGAIO_OP_READV` is set in the only place: `pgaio_io_start_readv`.
And this function is obviously called in user's backed with procno=28
which letter gets this assert failure.

So I try to understand dataflow.

1. Backend calls `pgaio_io_start_readv` to initiate IO. It assigns
`op=PGAIO_OP_READV`
2. It is broadcasted to all CPUs because of write barrier in
`pgaio_io_update_state(ioh, PGAIO_HS_DEFINED)` called fro `pgaio_io_stage`.
3. AIO thread gets notification about this request and somehow handle it
changing status to `COMPLETED_SHARED` in `pgaio_io_process_completion`
which notifies backend using condvar (which is expected to enforce write
barrier).
4. Backend see that status is `COMPLETED_SHARED`. Backend can do it in
`pgaio_io_wait` or in `pgaio_io_wait_for_free`. In first case there
seems to be no problem, because `pgaio_io_was_recycled` is used which
does read barrier. So you suspect `pgaio_io_wait_for_free` where read
barrier was missed (before your patch). But it is not quite clear to me
what can happen with "op" field without read barrier? In theory backend
can see old value. In practice it seems to be impossible because as far
as I know the unit of syncing data between CPU is cache line.

So my hypothesis was the following: backend observed old value of "op"
(PGAIO_OP_READV, because this is what it has written itself), then
overwrites it with PGAIO_OP_INVALID in `pgaio_io_reclaim` and then
delayed update from AIO worker is received which restores old value -
PGAIO_OP_READV.
But it seems to be not so realistic scenario.

Can you explain please how according to you theory not up-to-date
version of distilled_result can explain such behavior?
It will be possible if result contains index of the handle and stale
version of the index can cause addressing from IO handle.
But it is not the case - as we can see from the trace, address of `ioh`
in the backens is always the same.

>
>> This is what I am observing myself: "op" field is modified and fetched by
>> the same process.
> Right - but I don't think the ->op field being wrong was actually part of the
> issue though.
>
>
>> Certainly process can be rescheduled to some other CPU. But if such
>> reschedule can cause loose of stored value, then nothing will work, will it?
> Yes, that'd be completely broken - but isn't the issue here.
>
> Greetings,
>
> Andres Freund

In response to

Re: Non-reproducible AIO failure at 2025-06-12 13:57:38 from Andres Freund

Responses

Re: Non-reproducible AIO failure at 2025-06-17 15:08:48 from Andres Freund

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Daniel Gustafsson	2025-06-17 14:59:31	Re: Avoid possible dereference null pointer (src/backend/utils/cache/relcache.c)
Previous Message	Andres Freund	2025-06-17 14:49:00	Re: Proposal: Adding compression of temporary files