Quick Links

Re: Non-reproducible AIO failure

From:	Andres Freund <andres(at)anarazel(dot)de>
To:	Konstantin Knizhnik <knizhnik(at)garret(dot)ru>
Cc:	PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject:	Re: Non-reproducible AIO failure
Date:	2025-06-12 15:52:31
Message-ID:	t7j2bqm4cdxgix2o327rvymsr24utvp4gujorq2lrisvdgkuzb@jqdjbbwpyjdg
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

Hi,

On 2025-06-12 17:22:22 +0300, Konstantin Knizhnik wrote:
> On 12/06/2025 4:57 pm, Andres Freund wrote:
> > The problem appears to be in that switch between "when submitted, by the IO
> > worker" and "then again by the backend". It's not concurrent access in the
> > sense of two processes writing to the same value, it's that when switching
> > from the worker updating ->distilled_result to the issuer looking at that, the
> > issuer didn't ensure that no outdated version of ->distilled_result could be
> > used.
> >
> > Basically, the problem is that the worker would
> >
> > 1) set ->distilled_result
> > 2) perform a write memory barrier
> > 3) set ->state to COMPLETED_SHARED
> >
> > and then the issuer of the IO would:
> >
> > 4) check ->state is COMPLETED_SHARED
> > 5) use ->distilled_result
> >
> > The problem is that there currently is no barrier between 4 & 5, which means
> > an outdated ->distilled_result could be used.
> >
> >
> > This also explains why the issue looked so weird - eventually, after fprintfs,
> > after a core dump, etc, the updated ->distilled_result result would "arrive"
> > in the issuing process, and suddenly look correct.
>
> Thank you very much for explanation.
> Everything seems to be so simple after explanations, that you can not even
> believe that before you think that such behavior can be only caused by
> "black magic" or "OS bug":)

Indeed. For a while I was really starting to doubt my sanity - not helped by
the fact that I'd loose most of the mental context between reproductions (the
problem did not re-occur from Saturday to Wednesday morning...). What finally
made it clearer was moving the failing assertion to earlier
(pgaio_io_call_complete_local() and shared_buffer_readv_complete_local()), as
that took out a lot of the time in which the problem could occur.

> Certainly using outdated result can explain such behavior.
> But in which particular place we loose read barrier between 4 and 5?
> I see `pgaio_io_wait` which as I expect should be called by backend to wait
> completion of IO.
> And it calls `pgaio_io_was_recycled` to get state and it in turn enforce
> read barrier:
> ```
>
> bool
> pgaio_io_was_recycled(PgAioHandle *ioh, uint64 ref_generation,
> PgAioHandleState *state)
> {
>     *state = ioh->state;
>     pg_read_barrier();
>
>     return ioh->generation != ref_generation;
> }
> ```

Yes, I don't think that path has the issue.

As far as I can tell the problem only ever happens if we end up reclaiming an
IO while waiting for a free IO handle.

The known problematic path is
pgaio_io_acquire()
-> pgaio_io_wait_for_free()
-> pgaio_io_reclaim()

In that path we don't use pgaio_io_was_recycled(). I couldn't find any other
path with the same issue [1].

The issue only happening while waiting for free IO handles is presumably the
reason why it happend in 027_stream_regress.pl, due to the small
shared_buffers used for the test io_max_concurrency is 1, which means we
constantly need to wait for a free IO handles.

Greetings,

Andres Freund

[1] I think theoretically we are missing a memory barrier in the worker, when
starting to process an IO, but the lwlock for getting the IO to process
provides that. But we should probably add a comment about that.

In response to

Re: Non-reproducible AIO failure at 2025-06-12 14:22:22 from Konstantin Knizhnik

Responses

Re: Non-reproducible AIO failure at 2025-06-12 16:23:13 from Andres Freund

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Florents Tselai	2025-06-12 15:57:37	Re: Feature: psql - display current search_path in prompt
Previous Message	Nathan Bossart	2025-06-12 15:35:12	Re: add function for creating/attaching hash table in DSM registry