Re: Non-reproducible AIO failure

From: Andres Freund <andres(at)anarazel(dot)de>
To: Konstantin Knizhnik <knizhnik(at)garret(dot)ru>
Cc: PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: Non-reproducible AIO failure
Date: 2025-06-13 20:20:49
Message-ID: xtgg5qe65l7xz7undbewacoyezksxyjyu5yj2ve2bbasjgn5kr@hb5mqj7qjum5
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

On 2025-06-12 12:23:13 -0400, Andres Freund wrote:
> On 2025-06-12 11:52:31 -0400, Andres Freund wrote:
> > On 2025-06-12 17:22:22 +0300, Konstantin Knizhnik wrote:
> > > On 12/06/2025 4:57 pm, Andres Freund wrote:
> > > > The problem appears to be in that switch between "when submitted, by the IO
> > > > worker" and "then again by the backend". It's not concurrent access in the
> > > > sense of two processes writing to the same value, it's that when switching
> > > > from the worker updating ->distilled_result to the issuer looking at that, the
> > > > issuer didn't ensure that no outdated version of ->distilled_result could be
> > > > used.
> > > >
> > > > Basically, the problem is that the worker would
> > > >
> > > > 1) set ->distilled_result
> > > > 2) perform a write memory barrier
> > > > 3) set ->state to COMPLETED_SHARED
> > > >
> > > > and then the issuer of the IO would:
> > > >
> > > > 4) check ->state is COMPLETED_SHARED
> > > > 5) use ->distilled_result
> > > >
> > > > The problem is that there currently is no barrier between 4 & 5, which means
> > > > an outdated ->distilled_result could be used.
> > > >
> > > >
> > > > This also explains why the issue looked so weird - eventually, after fprintfs,
> > > > after a core dump, etc, the updated ->distilled_result result would "arrive"
> > > > in the issuing process, and suddenly look correct.
> > >
> > > Thank you very much for explanation.
> > > Everything seems to be so simple after explanations, that you can not even
> > > believe that before you think that such behavior can be only caused by
> > > "black magic" or "OS bug":)
> >
> > Indeed. For a while I was really starting to doubt my sanity - not helped by
> > the fact that I'd loose most of the mental context between reproductions (the
> > problem did not re-occur from Saturday to Wednesday morning...). What finally
> > made it clearer was moving the failing assertion to earlier
> > (pgaio_io_call_complete_local() and shared_buffer_readv_complete_local()), as
> > that took out a lot of the time in which the problem could occur.
>
> >
> > > Certainly using outdated result can explain such behavior.
> > > But in which particular place we loose read barrier between 4 and 5?
> > > I see `pgaio_io_wait` which as I expect should be called by backend to wait
> > > completion of IO.
> > > And it calls `pgaio_io_was_recycled` to get state and it in turn enforce
> > > read barrier:
> > > ```
> > >
> > > bool
> > > pgaio_io_was_recycled(PgAioHandle *ioh, uint64 ref_generation,
> > > PgAioHandleState *state)
> > > {
> > >     *state = ioh->state;
> > >     pg_read_barrier();
> > >
> > >     return ioh->generation != ref_generation;
> > > }
> > > ```
> >
> > Yes, I don't think that path has the issue.
> >
> > As far as I can tell the problem only ever happens if we end up reclaiming an
> > IO while waiting for a free IO handle.
> >
> > The known problematic path is
> > pgaio_io_acquire()
> > -> pgaio_io_wait_for_free()
> > -> pgaio_io_reclaim()
> >
> > In that path we don't use pgaio_io_was_recycled(). I couldn't find any other
> > path with the same issue [1].
>
> > The issue only happening while waiting for free IO handles is presumably the
> > reason why it happend in 027_stream_regress.pl, due to the small
> > shared_buffers used for the test io_max_concurrency is 1, which means we
> > constantly need to wait for a free IO handles.
>
> Armed with that knowledge it turns out to not be too hard to create a repro
> that fails a lot quicker than multiple days.
>
> Starting postgres with
>
> -c io_max_concurrency=1 -c io_combine_limit=1 -c synchronize_seqscans=false -c restart_after_crash=false -c max_parallel_workers_per_gather=0
>
> and creating a test table with:
> create table large as select generate_series(1, 100000) a, b from generate_series(1, 100) b;
>
> makes the following pgbench fail within seconds for me:
> c=16; pgbench -c $c -j $c -M prepared -n -f <(echo "select count(*) FROM large;") -T 1000 -P 10
>
> (interestingly at c=8 it takes closer to a minute on average)
>
> Adding pg_read_barrier() before the two pgaio_io_reclaim() calls in
> pgaio_io_wait_for_free() has so far prevented a crash. I'll leave it running
> for a while longer, to make sure.
>
> We really lack any meaningful stress testing of IO paths, or I suspect we
> would have found this much earlier. I wonder if we could add a smaller
> version of the above, with a small fixed runtime (maybe 3-5s), as a
> tap test.

Attached is a patch that fixes the problem for me. Alexander, Konstantin,
could you verify that it also fixes the problem for you?

Given that it does address the problem for me, I'm inclined to push this
fairly soon, the barrier is pretty obviously required.

Greetings,

Andres Freund

Attachment Content-Type Size
v1-0001-aio-Add-missing-memory-barrier-when-waiting-for-I.patch text/x-diff 5.7 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Nathan Bossart 2025-06-13 20:32:40 Re: pg_dump --with-* options
Previous Message Dmitry Koval 2025-06-13 20:06:55 Re: Add SPLIT PARTITION/MERGE PARTITIONS commands