Re: index prefetching

From: Andres Freund <andres(at)anarazel(dot)de>
To: Peter Geoghegan <pg(at)bowt(dot)ie>
Cc: Tomas Vondra <tomas(at)vondra(dot)me>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, Nazir Bilal Yavuz <byavuz81(at)gmail(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Melanie Plageman <melanieplageman(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Georgios <gkokolatos(at)protonmail(dot)com>, Konstantin Knizhnik <knizhnik(at)garret(dot)ru>, Dilip Kumar <dilipbalaut(at)gmail(dot)com>
Subject: Re: index prefetching
Date: 2025-08-15 15:09:41
Message-ID: 6butbqln6ewi5kuxz3kfv2mwomnlgtate4mb4lpa7gb2l63j4t@stlwbi2dvvev
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

On 2025-08-14 19:36:49 -0400, Andres Freund wrote:
> On 2025-08-14 17:55:53 -0400, Peter Geoghegan wrote:
> > On Thu, Aug 14, 2025 at 5:06 PM Peter Geoghegan <pg(at)bowt(dot)ie> wrote:
> > > > We can optimize that by deferring the StartBufferIO() if we're encountering a
> > > > buffer that is undergoing IO, at the cost of some complexity. I'm not sure
> > > > real-world queries will often encounter the pattern of the same block being
> > > > read in by a read stream multiple times in close proximity sufficiently often
> > > > to make that worth it.
> > >
> > > We definitely need to be prepared for duplicate prefetch requests in
> > > the context of index scans.
> >
> > Can you (or anybody else) think of a quick and dirty way of working
> > around the problem on the read stream side? I would like to prioritize
> > getting the patch into a state where its overall performance profile
> > "feels right". From there we can iterate on fixing the underlying
> > issues in more principled ways.
>
> I think I can see a way to fix the issue, below read stream. Basically,
> whenever AsyncReadBuffers() finds a buffer that has ongoing IO, instead of
> waiting, as we do today, copy the wref to the ReadBuffersOperation() and set a
> new flag indicating that we are waiting for an IO that was not started by the
> wref. Then, in WaitReadBuffers(), we wait for such foreign started IOs. That
> has to be somewhat different code from today, because we have to deal with the
> fact of the "foreign" IO potentially having failed.
>
> I'll try writing a prototype for that tomorrow. I think to actually get that
> into a committable shape we need a test harness (probably a read stream
> controlled by an SQL function that gets an array of buffers).

Attached is a prototype of this approach. It does seem to fix this issue.

New code disabled:

#### backwards sequential table ####
┌──────────────────────────────────────────────────────────────────────┐
│ QUERY PLAN │
├──────────────────────────────────────────────────────────────────────┤
│ Index Scan Backward using t_pk on t (actual rows=1048576.00 loops=1) │
│ Index Cond: ((a >= 16336) AND (a <= 49103)) │
│ Index Searches: 1 │
│ Buffers: shared hit=10291 read=49933 │
│ I/O Timings: shared read=213.277 │
│ Planning: │
│ Buffers: shared hit=91 read=19 │
│ I/O Timings: shared read=2.124 │
│ Planning Time: 3.269 ms │
│ Execution Time: 1023.279 ms │
└──────────────────────────────────────────────────────────────────────┘
(10 rows)

New code enabled:

#### backwards sequential table ####
┌──────────────────────────────────────────────────────────────────────┐
│ QUERY PLAN │
├──────────────────────────────────────────────────────────────────────┤
│ Index Scan Backward using t_pk on t (actual rows=1048576.00 loops=1) │
│ Index Cond: ((a >= 16336) AND (a <= 49103)) │
│ Index Searches: 1 │
│ Buffers: shared hit=10291 read=49933 │
│ I/O Timings: shared read=217.225 │
│ Planning: │
│ Buffers: shared hit=91 read=19 │
│ I/O Timings: shared read=2.009 │
│ Planning Time: 2.685 ms │
│ Execution Time: 602.987 ms │
└──────────────────────────────────────────────────────────────────────┘
(10 rows)

With the change enabled, the sequential query is faster than the random query:

#### backwards random table ####
┌────────────────────────────────────────────────────────────────────────────────────────────┐
│ QUERY PLAN │
├────────────────────────────────────────────────────────────────────────────────────────────┤
│ Index Scan Backward using t_randomized_pk on t_randomized (actual rows=1048576.00 loops=1) │
│ Index Cond: ((a >= 16336) AND (a <= 49103)) │
│ Index Searches: 1 │
│ Buffers: shared hit=6085 read=77813 │
│ I/O Timings: shared read=347.285 │
│ Planning: │
│ Buffers: shared hit=127 read=5 │
│ I/O Timings: shared read=1.001 │
│ Planning Time: 1.751 ms │
│ Execution Time: 820.544 ms │
└────────────────────────────────────────────────────────────────────────────────────────────┘
(10 rows)

Greetings,

Andres Freund

Attachment Content-Type Size
v1-0001-bufmgr-aio-Prototype-for-not-waiting-for-already-.patch.txt text/plain 6.3 KB

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2025-08-15 15:16:46 Re: Compilation issues for HASH_STATISTICS and HASH_DEBUG options
Previous Message Marthin Laubscher 2025-08-15 15:01:44 About Custom Aggregates, C Extensions and Memory