| From: | Peter Geoghegan <pg(at)bowt(dot)ie> |
|---|---|
| To: | Tomas Vondra <tomas(at)vondra(dot)me> |
| Cc: | Andres Freund <andres(at)anarazel(dot)de>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, Nazir Bilal Yavuz <byavuz81(at)gmail(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Melanie Plageman <melanieplageman(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Georgios <gkokolatos(at)protonmail(dot)com>, Konstantin Knizhnik <knizhnik(at)garret(dot)ru>, Dilip Kumar <dilipbalaut(at)gmail(dot)com> |
| Subject: | Re: index prefetching |
| Date: | 2026-01-31 00:18:02 |
| Message-ID: | CAH2-WzmymSyOt5Y2RGbm6cJXg18J_ttfqjdcpodHe6Gp23ConQ@mail.gmail.com |
| Views: | Whole Thread | Raw Message | Download mbox | Resend email |
| Thread: | |
| Lists: | pgsql-hackers |
On Sun, Jan 18, 2026 at 6:51 PM Peter Geoghegan <pg(at)bowt(dot)ie> wrote:
> Attached is v8. A lot has changed in just the past week. We're feeling
> optimistic about getting I/O prefetching into Postgres 19, and so
> we've decided to fully commit ourselves to that.
Attached is v9.
v8 just stopped applying against HEAD. But there have been a few
notable developments since v8.
> Dropping leaf page buffer pins during index-only scans
> ------------------------------------------------------
>
> I realized (following a point made by Matthias van de Meent during a
> private call) that it simply isn't necessary for index pages to hold
> on to buffer pins, in general. We haven't actually done that with
> nbtree plain index scans with MVCC snapshots since 2015's commit
> 2ed5b87f, which added what we now call nbtree's dropPin optimization.
> What if we could find a way to teach *every* amgetbatch-supporting
> index AM to do the same trick for *all* scans (barring non-MVCC
> snapshot scans)? Including index-only scans and scans of unlogged
> relations? Then we'd have zero chance of unintended interactions with
> the read stream; there'd simply be no extra buffer pins that might
> confuse the read stream in the first place!
Changes in v9, compared to v8:
* It turns out that this "always set all items all-visible up front"
approach can be too aggressive with certain kinds of queries -- we
need to be less aggressive with dropping leaf page pins during index
only scans.
It's always correct, but it wastes cycles in certain kinds of queries.
Things like nestloop anti-joins with an inner index scan will
generally only need to access the first item's VM info; wasting cycles
on eagerly setting all the cached VM info up front regresses such
queries unacceptably (like maybe 30%; way too much).
Tomas wrote code that works around the problem, by only dropping a
leaf page buffer pin (during an index only scan) when all of the
batch's items visibility info has been looked by the VM. At the same
time we still make strong guarantees about holding on to no more than
one such pin at a time. Which is all we really need to keep the pin
resource management problems (read stream pin interactions) simple
during scans that use prefetching.
* Separately, there's a new experimental patch that passes down a hint
from ExecSetTupleBound to index scans.
This allows us to be cleverer about avoiding starting a read stream in
queries that use nestloop joins with inner index scans that can end
early/on the first tuple (and also with simple ORDER BY ... LIMIT N
index scans). In other words, it fixes regressions with queries rather
like the ones we were concerned about with the previous VM caching
change. Though this second enhancement isn't just helpful when it's an
index-only scan on the inner side.
* Much simpler heapam_batch_getnext_tid logic, particularly for
dealing with changes in scan direction for a scrollable cursor.
The new design makes a clean separation between the need to
invalidate/disable our read stream due to a tuple-level change in the
scan's direction, and the need to deal with a change in scan direction
that actually crosses a boundary between 2 adjoining batches/index
pages. In the former case we need to avoid allowing the scan to read
the now-wrong heap blocks from the read stream, whereas in the latter
case we need to be careful about keeping the batch ring buffer state
consistent.
Notice that these are 2 distinct (albeit conceptually close) problems.
Cursor scans that (for whatever reason) don't use a read stream never
have the first problem, but might still run into the second. OTOH a
scan can easily need to deal with invalidating this way, without ever
getting as far as crossing a batch boundary (the scan could just end,
the scan direction to change back, and then we don't have to change
anything about batchringbuf state).
Also improved the ring buffer data structure. It now uses uint8
offsets that wrap as needed. This is more idiomatic, and seems easier
to test.
selfuncs.c problem
------------------
Also worth noting that we recently found a problem with selfuncs.c:
the VISITED_PAGES_LIMIT stuff in selfuncs.c is broken right now. v9
tears that code out, pending adding back a real fix (earlier versions
of the patch had VISITED_PAGES_LIMIT, but it didn't work).
The underlying problem is that the existing VISITED_PAGES_LIMIT design
is incompatible with our new table_index_getnext_slot interface. The
new interface doesn't stop scanning until it is able to at least
return 1 tuple. But VISITED_PAGES_LIMIT was invented precisely because
get_actual_variable_endpoint's index scans took far too long, even
though they're only ever required to locate 1 tuple. So that just
won't work.
We'll need to invent some kind of API that directly acknowledges the
needs of the selfuncs.c caller, and others like it. Doing it in an
ad-hoc way just doesn't seem possible anymore. That will have to wait
for the next revision, though (or the one after that).
--
Peter Geoghegan
| Attachment | Content-Type | Size |
|---|---|---|
| v9-0008-Don-t-wait-for-already-in-progress-IO.patch | application/octet-stream | 20.6 KB |
| v9-0010-Make-hash-index-AM-use-amgetbatch-interface.patch | application/octet-stream | 39.6 KB |
| v9-0009-Add-fake-LSN-support-to-hash-index-AM.patch | application/octet-stream | 13.8 KB |
| v9-0001-Extract-fake-LSN-infrastructure-from-GiST-index-A.patch | application/octet-stream | 16.9 KB |
| v9-0007-Make-buffer-hit-helper.patch | application/octet-stream | 5.8 KB |
| v9-0006-Use-ExecSetTupleBound-hint-during-index-scans.patch | application/octet-stream | 9.1 KB |
| v9-0005-Add-prefetching-to-index-scans-using-batch-interf.patch | application/octet-stream | 60.9 KB |
| v9-0002-Use-fake-LSNs-to-improve-nbtree-dropPin-behavior.patch | application/octet-stream | 15.1 KB |
| v9-0004-Introduce-read_stream_-pause-resume-yield.patch | application/octet-stream | 11.8 KB |
| v9-0003-Add-batching-interfaces-used-by-heapam-and-nbtree.patch | application/octet-stream | 201.7 KB |
| From | Date | Subject | |
|---|---|---|---|
| Next Message | Mark Wong | 2026-01-31 00:27:53 | Re: [PATCH] Add pg_get_database_ddl() function to reconstruct CREATE DATABASE statement |
| Previous Message | Mark Wong | 2026-01-30 22:05:33 | Re: Speed up COPY FROM text/CSV parsing using SIMD |