Re: Non-reproducible AIO failure

From: Alexander Lakhin <exclusion(at)gmail(dot)com>
To: Andres Freund <andres(at)anarazel(dot)de>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Konstantin Knizhnik <knizhnik(at)garret(dot)ru>, pgsql-hackers(at)lists(dot)postgresql(dot)org, Matthias van de Meent <boekewurm+postgres(at)gmail(dot)com>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>
Subject: Re: Non-reproducible AIO failure
Date: 2025-06-07 04:00:00
Message-ID: 92b33ab2-0596-40fe-9db6-a6d821d08e8a@gmail.com
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hello Andres and Tom,

06.06.2025 22:37, Andres Freund wrote:
> On 2025-06-06 15:21:13 -0400, Tom Lane wrote:
>> It's a mistake to think that this is a compiler bug. The C standard
>> explicitly allows compilers to use word-wide operations to access
>> bit-field struct members. Such accesses may fetch or overwrite other
>> bit-fields in the same word, using a non-atomic read/modify/write
>> sequence. I don't have a recent C standard at hand, but I found this
>> quote on the web [1]:
>>
>> The C Standard, 3.17, paragraph 3 [ISO/IEC 9899:2024], states
>>
>> Note 2 to entry: A bit-field and an adjacent non-bit-field member
>> are in separate memory locations. The same applies to two
>> bit-fields, if one is declared inside a nested structure
>> declaration and the other is not, or if the two are separated by a
>> zero-length bit-field declaration, or if they are separated by a
>> non-bit-field member declaration ...
>>
>> So it's our code that is busted. No doubt, what is happening is
>> that process A is fetching two fields, modifying one of them,
>> and storing the word back (with the observed value of the other
>> field) concurrently with some other process trying to update
>> the other field. So the other process's update is lost.
> There shouldn't be any concurrent accesses here, so I don't really see how the
> above would explain the problem (the IO can only ever be modified by one
> backend, initially the "owning backend", then, when submitted, by the IO
> worker, and then again by the backend).

FWIW,  I tried the idea Matthias proposed upthread:
--- a/src/include/storage/aio_internal.h
+++ b/src/include/storage/aio_internal.h
@@ -98,16 +98,17 @@ struct PgAioHandle
        /* all state updates should go through pgaio_io_update_state() */
        PgAioHandleState state:8;

+       /* bitfield of PgAioHandleFlags */
+       uint8           flags;
+
        /* what are we operating on */
        PgAioTargetID target:8;

+       uint8           num_callbacks;
+
        /* which IO operation */
        PgAioOp         op:8;

-       /* bitfield of PgAioHandleFlags */
-       uint8           flags;
-
-       uint8           num_callbacks;

        /* using the proper type here would use more space */
        uint8           callbacks[PGAIO_HANDLE_MAX_CALLBACKS];

and got 120 iterations passed cleanly. Without this change, I got failures
on iterations 41, 5, 20, Not a proof, just an observation...

May be the op field gets broken because of a (concurrent?) write to
not that field, by to state or target in some other place?

I also tried to reproduce the same on an ARM server again with clang 18,
using -O2 and -O3, and we with Konstantin didn't see such a sophisticated
assembly for pgaio_io_stage() — the code is much simpler and the failure
is not reproduced. The triplet there is:
PostgreSQL 18beta1 on aarch64-unknown-linux-gnu, compiled by Ubuntu clang version 18.1.3 (1ubuntu1), 64-bit

On that my MacBook, the triplet is:
PostgreSQL 18beta1 on aarch64-apple-darwin23.5.0, compiled by Apple clang version 16.0.0 (clang-1600.0.26.6), 64-bit
indri:
PostgreSQL 18beta1 on aarch64-apple-darwin24.5.0, compiled by Apple clang version 17.0.0 (clang-1700.0.13.3), 64-bit
sifaka:
PostgreSQL 18devel on aarch64-apple-darwin24.4.0, compiled by Apple clang version 17.0.0 (clang-1700.0.13.3), 64-bit

So it looks like "aarch64-apple" enables some extra optimizations with
bitfields...

By the way, there are also some other places with adjacent bitfields, e, g.:
typedef union PgAioTargetData
{
    struct
    {
        RelFileLocator rlocator;    /* physical relation identifier */
        BlockNumber blockNum;   /* blknum relative to begin of reln */
        BlockNumber nblocks;
        ForkNumber  forkNum:8;  /* don't waste 4 byte for four values */
        bool        is_temp:1;  /* proc can be inferred by owning AIO */
        bool        skip_fsync:1;
    }           smgr;

and
typedef struct PgStat_KindInfo
{
    /*
     * Do a fixed number of stats objects exist for this kind of stats (e.g.
     * bgwriter stats) or not (e.g. tables).
     */
    bool        fixed_amount:1;

    /*
     * Can stats of this kind be accessed from another database? Determines
     * whether a stats object gets included in stats snapshots.
     */
    bool        accessed_across_databases:1;

    /* Should stats be written to the on-disk stats file? */
    bool        write_to_file:1;
...

Best regards,
Alexander Lakhin

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Nico Williams 2025-06-07 04:28:12 Re: Non-reproducible AIO failure
Previous Message David Rowley 2025-06-07 00:03:53 Re: [PATCH] Re: Proposal to Enable/Disable Index using ALTER INDEX