Re: index prefetching

From: Tomas Vondra <tomas(at)vondra(dot)me>
To: Nazir Bilal Yavuz <byavuz81(at)gmail(dot)com>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>
Cc: Peter Geoghegan <pg(at)bowt(dot)ie>, Andres Freund <andres(at)anarazel(dot)de>, Robert Haas <robertmhaas(at)gmail(dot)com>, Melanie Plageman <melanieplageman(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Georgios <gkokolatos(at)protonmail(dot)com>, Konstantin Knizhnik <knizhnik(at)garret(dot)ru>, Dilip Kumar <dilipbalaut(at)gmail(dot)com>
Subject: Re: index prefetching
Date: 2025-08-12 16:53:13
Message-ID: 3dafd3be-8c20-4130-b956-eff178d9fe0a@vondra.me
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 8/12/25 13:22, Nazir Bilal Yavuz wrote:
> Hi,
>
> On Tue, 12 Aug 2025 at 08:07, Thomas Munro <thomas(dot)munro(at)gmail(dot)com> wrote:
>>
>> On Tue, Aug 12, 2025 at 11:42 AM Peter Geoghegan <pg(at)bowt(dot)ie> wrote:
>>> On Mon, Aug 11, 2025 at 5:07 PM Tomas Vondra <tomas(at)vondra(dot)me> wrote:
>>>> I can do some tests with forward vs. backwards scans. Of course, the
>>>> trouble with finding these weird cases is that they may be fairly rare.
>>>> So hitting them is a matter or luck or just happening to generate the
>>>> right data / query. But I'll give it a try and we'll see.
>>>
>>> I was talking more about finding "performance bugs" through a
>>> semi-directed process of trying random things while looking out for
>>> discrepancies. Something like that shouldn't require the usual
>>> "benchmarking rigor", since suspicious inconsistencies should be
>>> fairly obvious once encountered. I expect similar queries to have
>>> similar performance, regardless of superficial differences such as
>>> scan direction, DESC vs ASC column order, etc.
>>
>> I'd be interested to hear more about reverse scans. Bilal was
>> speculating about backwards I/O combining in read_stream.c a while
>> back, but we didn't have anything interesting to use it yet. You'll
>> probably see a flood of uncombined 8KB IOs in the pg_aios view while
>> travelling up the heap with cache misses today. I suspect Linux does
>> reverse sequential prefetching with buffered I/O (less sure about
>> other OSes) which should help but we'd still have more overheads than
>> we could if we combined them, not to mention direct I/O.
>
> If I remember correctly, I didn't continue working on this as I didn't
> see performance improvement. Right now, my changes don't apply cleanly
> to the current HEAD but I can give it another try if you see value in
> this.
>
>> Not tested, but something like this might do it:
>>
>> /* Can we merge it with the pending read? */
>> - if (stream->pending_read_nblocks > 0 &&
>> - stream->pending_read_blocknum +
>> stream->pending_read_nblocks == blocknum)
>> + if (stream->pending_read_nblocks > 0)
>> {
>> - stream->pending_read_nblocks++;
>> - continue;
>> + if (stream->pending_read_blocknum +
>> stream->pending_read_nblocks ==
>> + blocknum)
>> + {
>> + stream->pending_read_nblocks++;
>> + continue;
>> + }
>> + else if (stream->pending_read_blocknum ==
>> blocknum + 1 &&
>> + stream->forwarded_buffers == 0)
>> + {
>> + stream->pending_read_blocknum--;
>> + stream->pending_read_nblocks++;
>> + continue;
>> + }
>> }
>
> Unfortunately this doesn't work. We need to handle backwards I/O
> combining in the StartReadBuffersImpl() function too as buffer indexes
> won't have correct blocknums. Also, I think buffer forwarding of split
> backwards I/O should be handled in a couple of places.
>

I'm running some tests looking for these weird changes, not just with
the patches, but on master too. And I don't think b4212231 changed the
situation very much.

FWIW this issue is not caused by the index prefetching patches, I can
reproduce it with master (on b227b0bb4e032e19b3679bedac820eba3ac0d1cf
from yesterday). So maybe we should split this into a separate thread.

Consider for example the dataset built by create.sql - it's randomly
generated, but the idea is that it's correlated, but not perfectly. The
table is ~3.7GB, and it's a cold run - caches dropped + restart).

Anyway, a simple range query look like this:

EXPLAIN (ANALYZE, COSTS OFF)
SELECT * FROM t WHERE a BETWEEN 16336 AND 49103 ORDER BY a ASC;

QUERY PLAN
------------------------------------------------------------------------
Index Scan using idx on t
(actual time=0.584..433.208 rows=1048576.00 loops=1)
Index Cond: ((a >= 16336) AND (a <= 49103))
Index Searches: 1
Buffers: shared hit=7435 read=50872
I/O Timings: shared read=332.270
Planning:
Buffers: shared hit=78 read=23
I/O Timings: shared read=2.254
Planning Time: 3.364 ms
Execution Time: 463.516 ms
(10 rows)

EXPLAIN (ANALYZE, COSTS OFF)
SELECT * FROM t WHERE a BETWEEN 16336 AND 49103 ORDER BY a DESC;

QUERY PLAN
------------------------------------------------------------------------
Index Scan Backward using idx on t
(actual time=0.566..22002.780 rows=1048576.00 loops=1)
Index Cond: ((a >= 16336) AND (a <= 49103))
Index Searches: 1
Buffers: shared hit=36131 read=50872
I/O Timings: shared read=21217.995
Planning:
Buffers: shared hit=82 read=23
I/O Timings: shared read=2.375
Planning Time: 3.478 ms
Execution Time: 22231.755 ms
(10 rows)

That's a pretty massive difference ... this is on my laptop, and the
timing changes quite a bit, but it's always a multiple of the first
query with forward scan.

I did look into pg_aios, but there's only 8kB requests in both cases. I
didn't have time to look closer yet.

regards

--
Tomas Vondra

Attachment Content-Type Size
create.sql application/sql 602 bytes

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Corey Huinker 2025-08-12 17:02:39 Import Statistics in postgres_fdw before resorting to sampling.
Previous Message Tom Lane 2025-08-12 16:38:19 Re: Pathify RHS unique-ification for semijoin planning