| From: | Alexander Lakhin <exclusion(at)gmail(dot)com> |
|---|---|
| To: | Andres Freund <andres(at)anarazel(dot)de>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> |
| Cc: | Konstantin Knizhnik <knizhnik(at)garret(dot)ru>, pgsql-hackers(at)lists(dot)postgresql(dot)org, Matthias van de Meent <boekewurm+postgres(at)gmail(dot)com>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com> |
| Subject: | Re: Non-reproducible AIO failure |
| Date: | 2025-06-07 04:00:00 |
| Message-ID: | 92b33ab2-0596-40fe-9db6-a6d821d08e8a@gmail.com |
| Views: | Whole Thread | Raw Message | Download mbox | Resend email |
| Thread: | |
| Lists: | pgsql-hackers |
Hello Andres and Tom,
06.06.2025 22:37, Andres Freund wrote:
> On 2025-06-06 15:21:13 -0400, Tom Lane wrote:
>> It's a mistake to think that this is a compiler bug. The C standard
>> explicitly allows compilers to use word-wide operations to access
>> bit-field struct members. Such accesses may fetch or overwrite other
>> bit-fields in the same word, using a non-atomic read/modify/write
>> sequence. I don't have a recent C standard at hand, but I found this
>> quote on the web [1]:
>>
>> The C Standard, 3.17, paragraph 3 [ISO/IEC 9899:2024], states
>>
>> Note 2 to entry: A bit-field and an adjacent non-bit-field member
>> are in separate memory locations. The same applies to two
>> bit-fields, if one is declared inside a nested structure
>> declaration and the other is not, or if the two are separated by a
>> zero-length bit-field declaration, or if they are separated by a
>> non-bit-field member declaration ...
>>
>> So it's our code that is busted. No doubt, what is happening is
>> that process A is fetching two fields, modifying one of them,
>> and storing the word back (with the observed value of the other
>> field) concurrently with some other process trying to update
>> the other field. So the other process's update is lost.
> There shouldn't be any concurrent accesses here, so I don't really see how the
> above would explain the problem (the IO can only ever be modified by one
> backend, initially the "owning backend", then, when submitted, by the IO
> worker, and then again by the backend).
FWIW, I tried the idea Matthias proposed upthread:
--- a/src/include/storage/aio_internal.h
+++ b/src/include/storage/aio_internal.h
@@ -98,16 +98,17 @@ struct PgAioHandle
/* all state updates should go through pgaio_io_update_state() */
PgAioHandleState state:8;
+ /* bitfield of PgAioHandleFlags */
+ uint8 flags;
+
/* what are we operating on */
PgAioTargetID target:8;
+ uint8 num_callbacks;
+
/* which IO operation */
PgAioOp op:8;
- /* bitfield of PgAioHandleFlags */
- uint8 flags;
-
- uint8 num_callbacks;
/* using the proper type here would use more space */
uint8 callbacks[PGAIO_HANDLE_MAX_CALLBACKS];
and got 120 iterations passed cleanly. Without this change, I got failures
on iterations 41, 5, 20, Not a proof, just an observation...
May be the op field gets broken because of a (concurrent?) write to
not that field, by to state or target in some other place?
I also tried to reproduce the same on an ARM server again with clang 18,
using -O2 and -O3, and we with Konstantin didn't see such a sophisticated
assembly for pgaio_io_stage() — the code is much simpler and the failure
is not reproduced. The triplet there is:
PostgreSQL 18beta1 on aarch64-unknown-linux-gnu, compiled by Ubuntu clang version 18.1.3 (1ubuntu1), 64-bit
On that my MacBook, the triplet is:
PostgreSQL 18beta1 on aarch64-apple-darwin23.5.0, compiled by Apple clang version 16.0.0 (clang-1600.0.26.6), 64-bit
indri:
PostgreSQL 18beta1 on aarch64-apple-darwin24.5.0, compiled by Apple clang version 17.0.0 (clang-1700.0.13.3), 64-bit
sifaka:
PostgreSQL 18devel on aarch64-apple-darwin24.4.0, compiled by Apple clang version 17.0.0 (clang-1700.0.13.3), 64-bit
So it looks like "aarch64-apple" enables some extra optimizations with
bitfields...
By the way, there are also some other places with adjacent bitfields, e, g.:
typedef union PgAioTargetData
{
struct
{
RelFileLocator rlocator; /* physical relation identifier */
BlockNumber blockNum; /* blknum relative to begin of reln */
BlockNumber nblocks;
ForkNumber forkNum:8; /* don't waste 4 byte for four values */
bool is_temp:1; /* proc can be inferred by owning AIO */
bool skip_fsync:1;
} smgr;
and
typedef struct PgStat_KindInfo
{
/*
* Do a fixed number of stats objects exist for this kind of stats (e.g.
* bgwriter stats) or not (e.g. tables).
*/
bool fixed_amount:1;
/*
* Can stats of this kind be accessed from another database? Determines
* whether a stats object gets included in stats snapshots.
*/
bool accessed_across_databases:1;
/* Should stats be written to the on-disk stats file? */
bool write_to_file:1;
...
Best regards,
Alexander Lakhin
| From | Date | Subject | |
|---|---|---|---|
| Next Message | Nico Williams | 2025-06-07 04:28:12 | Re: Non-reproducible AIO failure |
| Previous Message | David Rowley | 2025-06-07 00:03:53 | Re: [PATCH] Re: Proposal to Enable/Disable Index using ALTER INDEX |