Re: Non-reproducible AIO failure

From: Konstantin Knizhnik <knizhnik(at)garret(dot)ru>
To: Andres Freund <andres(at)anarazel(dot)de>
Cc: PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: Non-reproducible AIO failure
Date: 2025-06-15 11:48:43
Message-ID: 174af978-1406-4342-8990-87f2f0804107@garret.ru
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers


On 13/06/2025 11:20 pm, Andres Freund wrote:
> Attached is a patch that fixes the problem for me. Alexander, Konstantin,
> could you verify that it also fixes the problem for you?
>
> Given that it does address the problem for me, I'm inclined to push this
> fairly soon, the barrier is pretty obviously required.
>
Unfortunately I still able to reproduce assertion failure after 17450
seconds (`--enable-debug --enable-cassert CFLAGS=-O0`):

  * frame #0: 0x0000000187248704 libsystem_kernel.dylib`__pthread_kill + 8
    frame #1: 0x000000018727fc28 libsystem_pthread.dylib`pthread_kill + 288
    frame #2: 0x000000018718dae8 libsystem_c.dylib`abort + 180
    frame #3: 0x00000001011a0fbc
postgres`ExceptionalCondition(conditionName="ioh->op ==
PGAIO_OP_INVALID", fileName="aio_io.c", lineNumber=161) at assert.c:66:2
    frame #4: 0x0000000100ea9b40
postgres`pgaio_io_before_start(ioh=0x000000010a6e22f0) at aio_io.c:161:2
    frame #5: 0x0000000100ea9a00
postgres`pgaio_io_start_readv(ioh=0x000000010a6e22f0, fd=12, iovcnt=1,
offset=85565440) at aio_io.c:81:2
    frame #6: 0x0000000100ec8798
postgres`FileStartReadV(ioh=0x000000010a6e22f0, file=4, iovcnt=1,
offset=85565440, wait_event_info=167772181) at fd.c:2241:2
    frame #7: 0x0000000100f1b7d0
postgres`mdstartreadv(ioh=0x000000010a6e22f0, reln=0x000000010b0289c8,
forknum=MAIN_FORKNUM, blocknum=10445, buffers=0x000000016f5ed998,
nblocks=1) at md.c:1019:8
    frame #8: 0x0000000100f1ed88
postgres`smgrstartreadv(ioh=0x000000010a6e22f0, reln=0x000000010b0289c8,
forknum=MAIN_FORKNUM, blocknum=10445, buffers=0x000000016f5ed998,
nblocks=1) at smgr.c:758:2
    frame #9: 0x0000000100eb1a48
postgres`AsyncReadBuffers(operation=0x000000011d80d108,
nblocks_progress=0x000000016f5edeb4) at bufmgr.c:1959:3
    frame #10: 0x0000000100eb0c24
postgres`StartReadBuffersImpl(operation=0x000000011d80d108,
buffers=0x000000011d80cdf4, blockNum=10445, nblocks=0x000000016f5edeb4,
flags=0, allow_forwarding=true) at bufmgr.c:1428:18

I wonder if we also need second write barrier in pgaio_io_update_state:

    /*
     * Ensure the changes signified by the new state are visible before the
     * new state becomes visible.
     */
    pg_write_barrier();

    ioh->state = new_state;

    /*
     * Make sure that status is written
     */
    pg_write_barrier();

Otherwise is is not clear when status is written.
Also I think that replacing bitfields with `uint8` and may be even with
`int`, is good idea at least to avoids false sharing.

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Ryo Kanbayashi 2025-06-15 12:02:31 Re: [PATCH] PGSERVICEFILE as part of a normal connection string
Previous Message Alexander Korotkov 2025-06-15 11:02:44 Re: Slot's restart_lsn may point to removed WAL segment after hard restart unexpectedly