Quick Links

Re: index prefetching

From:	Peter Geoghegan <pg(at)bowt(dot)ie>
To:	Andres Freund <andres(at)anarazel(dot)de>
Cc:	Tomas Vondra <tomas(at)vondra(dot)me>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, Nazir Bilal Yavuz <byavuz81(at)gmail(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Melanie Plageman <melanieplageman(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Georgios <gkokolatos(at)protonmail(dot)com>, Konstantin Knizhnik <knizhnik(at)garret(dot)ru>, Dilip Kumar <dilipbalaut(at)gmail(dot)com>
Subject:	Re: index prefetching
Date:	2025-08-14 21:06:07
Message-ID:	CAH2-WzkWNtCRTcUajGYrCkp9-+btteAthg21BzxbKV09AJuSrA@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

On Thu, Aug 14, 2025 at 4:44 PM Andres Freund <andres(at)anarazel(dot)de> wrote:
> Interesting. In the sequential case I see some waits that are not attributed
> in explain, due to the waits happening within WaitIO(), not WaitReadBuffers().
> Which indicates that the read stream is trying to re-read a buffer that
> previously started being read.

I *knew* that something had to be up here. Thanks for your help with debugging!

> read_stream_start_pending_read()
> -> StartReadBuffers()
> -> AsyncReadBuffers()
> -> ReadBuffersCanStartIO()
> -> StartBufferIO()
> -> WaitIO()
>
> There are far fewer cases of this in the random case.

Index tuples with TIDs that are slightly out of order are very normal.
Even for *perfectly* sequential inserts, the FSM tends to use the last
piece of free space on a heap page some time after the heap page
initially becomes "almost full". I recently described this to Tomas on
this thread [1].

> From what I can tell the sequential case so often will re-read a buffer that
> it is already in the process of reading - and thus wait for that IO before
> continuing - that we don't actually keep enough IO in flight.

Oops.

There is an existing stop-gap mechanism in the patch that is supposed
to deal with this problem. index_scan_stream_read_next, which is the
read stream callback, has logic that is supposed to suppress duplicate
block requests. But that's obviously not totally effective, since it
only remembers the very last heap block request.

If this same mechanism remembered (say) the last 2 heap blocks it
requested, that might be enough to totally fix this particular
problem. This isn't a serious proposal, but it'll be simple enough to
implement. Hopefully when I do that (which I plan to soon) it'll fully
validate your theory.

> We can optimize that by deferring the StartBufferIO() if we're encountering a
> buffer that is undergoing IO, at the cost of some complexity. I'm not sure
> real-world queries will often encounter the pattern of the same block being
> read in by a read stream multiple times in close proximity sufficiently often
> to make that worth it.

We definitely need to be prepared for duplicate prefetch requests in
the context of index scans. I'm far from sure how sophisticated that
actually needs to be. Obviously the design choices in this area are
far from settled right now.

[1] DC1G2PKUO9CI(dot)3MK1L3YBZ2V3T(at)bowt(dot)ie
--
Peter Geoghegan

In response to

Re: index prefetching at 2025-08-14 20:44:14 from Andres Freund

Responses

Re: index prefetching at 2025-08-14 21:55:53 from Peter Geoghegan

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Peter Geoghegan	2025-08-14 21:55:53	Re: index prefetching
Previous Message	Andres Freund	2025-08-14 20:44:14	Re: index prefetching