From: | Andres Freund <andres(at)anarazel(dot)de> |
---|---|
To: | Konstantin Knizhnik <knizhnik(at)garret(dot)ru> |
Cc: | PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org> |
Subject: | Re: Non-reproducible AIO failure |
Date: | 2025-06-12 15:52:31 |
Message-ID: | t7j2bqm4cdxgix2o327rvymsr24utvp4gujorq2lrisvdgkuzb@jqdjbbwpyjdg |
Views: | Whole Thread | Raw Message | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
Hi,
On 2025-06-12 17:22:22 +0300, Konstantin Knizhnik wrote:
> On 12/06/2025 4:57 pm, Andres Freund wrote:
> > The problem appears to be in that switch between "when submitted, by the IO
> > worker" and "then again by the backend". It's not concurrent access in the
> > sense of two processes writing to the same value, it's that when switching
> > from the worker updating ->distilled_result to the issuer looking at that, the
> > issuer didn't ensure that no outdated version of ->distilled_result could be
> > used.
> >
> > Basically, the problem is that the worker would
> >
> > 1) set ->distilled_result
> > 2) perform a write memory barrier
> > 3) set ->state to COMPLETED_SHARED
> >
> > and then the issuer of the IO would:
> >
> > 4) check ->state is COMPLETED_SHARED
> > 5) use ->distilled_result
> >
> > The problem is that there currently is no barrier between 4 & 5, which means
> > an outdated ->distilled_result could be used.
> >
> >
> > This also explains why the issue looked so weird - eventually, after fprintfs,
> > after a core dump, etc, the updated ->distilled_result result would "arrive"
> > in the issuing process, and suddenly look correct.
>
> Thank you very much for explanation.
> Everything seems to be so simple after explanations, that you can not even
> believe that before you think that such behavior can be only caused by
> "black magic" or "OS bug":)
Indeed. For a while I was really starting to doubt my sanity - not helped by
the fact that I'd loose most of the mental context between reproductions (the
problem did not re-occur from Saturday to Wednesday morning...). What finally
made it clearer was moving the failing assertion to earlier
(pgaio_io_call_complete_local() and shared_buffer_readv_complete_local()), as
that took out a lot of the time in which the problem could occur.
> Certainly using outdated result can explain such behavior.
> But in which particular place we loose read barrier between 4 and 5?
> I see `pgaio_io_wait` which as I expect should be called by backend to wait
> completion of IO.
> And it calls `pgaio_io_was_recycled` to get state and it in turn enforce
> read barrier:
> ```
>
> bool
> pgaio_io_was_recycled(PgAioHandle *ioh, uint64 ref_generation,
> PgAioHandleState *state)
> {
> *state = ioh->state;
> pg_read_barrier();
>
> return ioh->generation != ref_generation;
> }
> ```
Yes, I don't think that path has the issue.
As far as I can tell the problem only ever happens if we end up reclaiming an
IO while waiting for a free IO handle.
The known problematic path is
pgaio_io_acquire()
-> pgaio_io_wait_for_free()
-> pgaio_io_reclaim()
In that path we don't use pgaio_io_was_recycled(). I couldn't find any other
path with the same issue [1].
The issue only happening while waiting for free IO handles is presumably the
reason why it happend in 027_stream_regress.pl, due to the small
shared_buffers used for the test io_max_concurrency is 1, which means we
constantly need to wait for a free IO handles.
Greetings,
Andres Freund
[1] I think theoretically we are missing a memory barrier in the worker, when
starting to process an IO, but the lwlock for getting the IO to process
provides that. But we should probably add a comment about that.
From | Date | Subject | |
---|---|---|---|
Next Message | Florents Tselai | 2025-06-12 15:57:37 | Re: Feature: psql - display current search_path in prompt |
Previous Message | Nathan Bossart | 2025-06-12 15:35:12 | Re: add function for creating/attaching hash table in DSM registry |