Re: Non-reproducible AIO failure

From: Andres Freund <andres(at)anarazel(dot)de>
To: Konstantin Knizhnik <knizhnik(at)garret(dot)ru>
Cc: PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: Non-reproducible AIO failure
Date: 2025-06-12 16:23:13
Message-ID: mncqmsiizh6mbgkce5zjp3nbaytuan6djwdpze6lmb4hujbrne@dcdkfinpg2vg
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

On 2025-06-12 11:52:31 -0400, Andres Freund wrote:
> On 2025-06-12 17:22:22 +0300, Konstantin Knizhnik wrote:
> > On 12/06/2025 4:57 pm, Andres Freund wrote:
> > > The problem appears to be in that switch between "when submitted, by the IO
> > > worker" and "then again by the backend". It's not concurrent access in the
> > > sense of two processes writing to the same value, it's that when switching
> > > from the worker updating ->distilled_result to the issuer looking at that, the
> > > issuer didn't ensure that no outdated version of ->distilled_result could be
> > > used.
> > >
> > > Basically, the problem is that the worker would
> > >
> > > 1) set ->distilled_result
> > > 2) perform a write memory barrier
> > > 3) set ->state to COMPLETED_SHARED
> > >
> > > and then the issuer of the IO would:
> > >
> > > 4) check ->state is COMPLETED_SHARED
> > > 5) use ->distilled_result
> > >
> > > The problem is that there currently is no barrier between 4 & 5, which means
> > > an outdated ->distilled_result could be used.
> > >
> > >
> > > This also explains why the issue looked so weird - eventually, after fprintfs,
> > > after a core dump, etc, the updated ->distilled_result result would "arrive"
> > > in the issuing process, and suddenly look correct.
> >
> > Thank you very much for explanation.
> > Everything seems to be so simple after explanations, that you can not even
> > believe that before you think that such behavior can be only caused by
> > "black magic" or "OS bug":)
>
> Indeed. For a while I was really starting to doubt my sanity - not helped by
> the fact that I'd loose most of the mental context between reproductions (the
> problem did not re-occur from Saturday to Wednesday morning...). What finally
> made it clearer was moving the failing assertion to earlier
> (pgaio_io_call_complete_local() and shared_buffer_readv_complete_local()), as
> that took out a lot of the time in which the problem could occur.

>
> > Certainly using outdated result can explain such behavior.
> > But in which particular place we loose read barrier between 4 and 5?
> > I see `pgaio_io_wait` which as I expect should be called by backend to wait
> > completion of IO.
> > And it calls `pgaio_io_was_recycled` to get state and it in turn enforce
> > read barrier:
> > ```
> >
> > bool
> > pgaio_io_was_recycled(PgAioHandle *ioh, uint64 ref_generation,
> > PgAioHandleState *state)
> > {
> >     *state = ioh->state;
> >     pg_read_barrier();
> >
> >     return ioh->generation != ref_generation;
> > }
> > ```
>
> Yes, I don't think that path has the issue.
>
> As far as I can tell the problem only ever happens if we end up reclaiming an
> IO while waiting for a free IO handle.
>
> The known problematic path is
> pgaio_io_acquire()
> -> pgaio_io_wait_for_free()
> -> pgaio_io_reclaim()
>
> In that path we don't use pgaio_io_was_recycled(). I couldn't find any other
> path with the same issue [1].

> The issue only happening while waiting for free IO handles is presumably the
> reason why it happend in 027_stream_regress.pl, due to the small
> shared_buffers used for the test io_max_concurrency is 1, which means we
> constantly need to wait for a free IO handles.

Armed with that knowledge it turns out to not be too hard to create a repro
that fails a lot quicker than multiple days.

Starting postgres with

-c io_max_concurrency=1 -c io_combine_limit=1 -c synchronize_seqscans=false -c restart_after_crash=false -c max_parallel_workers_per_gather=0

and creating a test table with:
create table large as select generate_series(1, 100000) a, b from generate_series(1, 100) b;

makes the following pgbench fail within seconds for me:
c=16; pgbench -c $c -j $c -M prepared -n -f <(echo "select count(*) FROM large;") -T 1000 -P 10

(interestingly at c=8 it takes closer to a minute on average)

Adding pg_read_barrier() before the two pgaio_io_reclaim() calls in
pgaio_io_wait_for_free() has so far prevented a crash. I'll leave it running
for a while longer, to make sure.

We really lack any meaningful stress testing of IO paths, or I suspect we
would have found this much earlier. I wonder if we could add a smaller
version of the above, with a small fixed runtime (maybe 3-5s), as a tap test.

Greetings,

Andres Freund

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Álvaro Herrera 2025-06-12 16:32:38 Re: queryId constant squashing does not support prepared statements
Previous Message Jeff Davis 2025-06-12 15:58:15 Re: pg_dump --with-* options