Re: index prefetching

From: Andres Freund <andres(at)anarazel(dot)de>
To: Tomas Vondra <tomas(at)vondra(dot)me>
Cc: Alexandre Felipe <o(dot)alexandre(dot)felipe(at)gmail(dot)com>, Peter Geoghegan <pg(at)bowt(dot)ie>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, Nazir Bilal Yavuz <byavuz81(at)gmail(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Melanie Plageman <melanieplageman(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Georgios <gkokolatos(at)protonmail(dot)com>, Konstantin Knizhnik <knizhnik(at)garret(dot)ru>, Dilip Kumar <dilipbalaut(at)gmail(dot)com>
Subject: Re: index prefetching
Date: 2026-03-02 15:01:02
Message-ID: 3cbwjhwkomjv7jifau4yhb357gfnckut3sdrlbmhwzesd3kngj@affs2mpxg4gh
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

On 2026-03-02 14:18:31 +0100, Tomas Vondra wrote:
> On 3/2/26 10:00, Alexandre Felipe wrote:
> >
> >
> > On Sun, Mar 1, 2026 at 11:33 PM Tomas Vondra <tomas(at)vondra(dot)me
> > <mailto:tomas(at)vondra(dot)me>> wrote:
> >
> > On 3/1/26 23:32, Alexandre Felipe wrote:
> > >
> > > On Sun, Mar 1, 2026 at 3:03 PM Tomas Vondra <tomas(at)vondra(dot)me
> > <mailto:tomas(at)vondra(dot)me>
> > > <mailto:tomas(at)vondra(dot)me <mailto:tomas(at)vondra(dot)me>>> wrote:
> > >
> > >     Hi,
> > >
> > >     I've decided to run a couple tests, trying to reproduce some
> > of the
> > >     behaviors described in your (Felipe's) messages.
> > >
> > >
> > > Thank you,
> > > I will look into this data later. I am impressed with the number of IO
> > > workers
> > > you used, my test was typically with 3.
> > >
> >
> > 3 is extremely low for an I/O bound system. It's our tradition to pick
> > defaults that work even on tiny systems, but need tuning on actual
> > non-toy systems :-(
> >
> >
> > That is was a surprise for me, because I am used to javascript
> > that does everything in one single process (with a coroutine
> > async model) and does with very little overhead.
> >
>
> Well, we don't have coroutines (or other threads). If you want something
> similarly lightweight, you need to use io_uring. Which means you need to
> be on Linux, unfortunately.

Even if we had coroutines, there's no portable way of integrating them with
file IO. For that you need either something like io_uring, IOCP or posix_aio
(although the latter performs so terribly on most platforms, including macos,
it's not worth using).

> > This could explain Andres Freund observation [1]
> >
> >> It seems caused to a significant degree by waiting at low queue
> > depths. If I
> >> comment out the stream->distance-- in read_stream_start_pending_read() the
> >> regression is reduced greatly.
> >
>
> Yes, because the distance heuristics is based solely on buffer hits and
> misses (and not on some direct feedback if we're prefetching far enough).
>
> For any algorithm that increases/decreases the distance (e.g. the
> current distance*2 and distance--) there is a data patterns that
> collapses to 1 too early. It just takes the right fraction of hits and
> misses. (If we ignore "trivial" heuristics that never decreases the
> distance etc.)

I think this is making the problem sound harder than it has to be to be a
significant im provement:

Right now an IO pattern like [(miss, hit)+] will oscillate between distance =
0 and 1, and thus will do all IO synchronuously. I.e. every second required
buffer will require a synchronous read - you'll have a terrible IO
performance.

This is also true for any pattern that will have more hits, but not more
misses in a row - those are extremely common pattern.

If you instead make the distance increase logic something like:
distance = Max(4, distance * 2);
and have the decrease logic be something like

/*
* Only decrease distance if there are no IOs in the queue. As long as we
* are asynchronously executing IO we are benefiting from the higher
* distance.
*/
if (stream->ios_in_progress == 0 && stream->distance > 1)
stream->distance--;

You need a more adverse pattern to not get any readahead. Even if you look at
the most trivial adverse pattern ([(miss, hits{3})+]), you've reduced the
number of synchronous waits by 2x compared to the current worst case pattern.

And if you ever have two hits separated by less than three hits in a row,
distance will much more often not be decreased by the hits (because there's
still IO in progress when encountering a hit) and distance will increase
further (because we need to wait for IO).

Another thing that we probably ought to do is to perform lookahead when the
fast path encountered a miss (rather than doing so *after* waiting for the
IO), so that we at least amortize the cost of having to wait across multiple
IOs.

> IMHO the simplest solution with the current heuristics would be to not
> allow the distance to "drop" below a "small" value that is high enough
> to hide the AIO overhead.

Unfortunately that has too big a performance penalty for fully cached
workloads :(. Doing buffer mapping lookups ahead of the current point is not
free.

We could probably do something much more aggressive than what I described
above without meaningful negative impacts, substantially reducing the maximum
negative impact of decreasing the distance. E.g. a separate cooloff counter
that needs to be decremented before the distance can be decreased after a
miss. If we e.g. were to only allow to decrease distance after 32 buffer
hits, at a time where no IO is in progress, the worst case #waits is a lot
lower than today.

Greetings,

Andres Freund

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Andres Freund 2026-03-02 15:02:49 Re: pg_stat_io_histogram
Previous Message Tom Lane 2026-03-02 14:56:48 Re: Question: rebuilding frontend tools after libpgfeutils.a changes?