Re: index prefetching

From: Tomas Vondra <tomas(at)vondra(dot)me>
To: Peter Geoghegan <pg(at)bowt(dot)ie>
Cc: Andres Freund <andres(at)anarazel(dot)de>, Robert Haas <robertmhaas(at)gmail(dot)com>, Melanie Plageman <melanieplageman(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Georgios <gkokolatos(at)protonmail(dot)com>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, Konstantin Knizhnik <knizhnik(at)garret(dot)ru>, Dilip Kumar <dilipbalaut(at)gmail(dot)com>
Subject: Re: index prefetching
Date: 2025-07-18 18:31:37
Message-ID: aa46af80-5219-47e6-a7d0-7628106965a6@vondra.me
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

I was wondering why the "simple" approach performs so much worse than
the "complex" one on some of the data sets. The theory was that it's due
to using read_stream_reset(), which resets the prefetch distance, and so
we need to "ramp up" from scratch (distance=1) for every batch. Which
for the correlated data sets is very often.

So I decided to do some experiments, to see if this is really the case,
and maybe see if read_stream_reset() could fix this in some way.

First, I added an

elog(LOG, "distance %d", stream->distance);

at the beginning of read_stream_next_block() to see how the distance
changes during the scan. Consider a query returning 2M rows from the
"cyclic" table (the attached .sql creates/pupulates it):

-- selects 20% rows
SELECT * FROM cyclic WHERE a BETWEEN 0 AND 20000;

With the "complex" patch, the CDF of the distance looks like this:

+----------+-----+
| distance | pct |
+----------+-----+
| 0 | 0 |
| 25 | 0 |
| 50 | 0 |
| 75 | 0 |
| 100 | 0 |
| 125 | 0 |
| 150 | 0 |
| 175 | 0 |
| 200 | 0 |
| 225 | 0 |
| 250 | 0 |
| 275 | 99 |
| 300 | 99 |
+----------+-----+

That is, 99% of the distances is in the range [275, 300].

Note: This is much higher than the effective_io_concurrency value (16),
which may be surprising. But the ReadStream uses that to limit the
number of I/O requests, not as a limit of how far to look ahead. A lot
of the blocks are in the cache, so it looks far ahead.

But with the "simple" patch it looks like this:

+----------+-----+
| distance | pct |
+----------+-----+
| 0 | 0 |
| 25 | 99 |
| 50 | 99 |
| 75 | 99 |
| 100 | 99 |
| 125 | 99 |
| 150 | 99 |
| 175 | 99 |
| 200 | 99 |
| 225 | 99 |
| 250 | 99 |
| 275 | 100 |
| 300 | 100 |
+----------+-----+

So 99% of the distances is in [0, 25]. A more detailed view on the first
couple distances:

+----------+-----+
| distance | pct |
+----------+-----+
| 0 | 0 |
| 1 | 99 |
| 2 | 99 |
| 3 | 99 |
| 4 | 99 |
...

So 99% of distances is 1. Well, that's not very far, it effectively
means no prefetching (We still issue the fadvise, though, although a
comment in read_stream.c suggests we won't. Possible bug?).

This means *there's no ramp-up at all*. On the first leaf the distance
grows to ~270, but after the stream gets reset it stays at 1 and never
increases. That's ... not great?

I'm not entirely sure

I decided to hack the ReadStream a bit, so that it restores the last
non-zero distance seen (i.e. right before reaching end of the stream).
And with that I got this:

+----------+-----+
| distance | pct |
+----------+-----+
| 0 | 0 |
| 25 | 38 |
| 50 | 38 |
| 75 | 38 |
| 100 | 39 |
| 125 | 42 |
| 150 | 47 |
| 175 | 47 |
| 200 | 48 |
| 225 | 49 |
| 250 | 50 |
| 275 | 100 |
| 300 | 100 |
+----------+-----+

Not as good as the "complex" patch, but much better than the original.
And the performance got almost the same (for this one query).

Perhaps the ReadStream should do something like this? Of course, the
simple patch resets the stream very often, likely mcuh more often than
anything else in the code. But wouldn't it be beneficial for streams
reset because of a rescan? Possibly needs to be optional.

regards

--
Tomas Vondra

Attachment Content-Type Size
0001-read_stream-restore-prefetch-distance-after-reset.patch text/x-patch 2.0 KB
cyclic.sql application/sql 314 bytes

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Peter Geoghegan 2025-07-18 18:50:46 Re: index prefetching
Previous Message Jacob Champion 2025-07-18 18:11:15 Re: libpq: Process buffered SSL read bytes to support records >8kB on async API