Re: index prefetching

From: Andres Freund <andres(at)anarazel(dot)de>
To: Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com>
Cc: PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Georgios <gkokolatos(at)protonmail(dot)com>
Subject: Re: index prefetching
Date: 2023-06-09 00:06:00
Message-ID: 20230609000600.syqy447e6metnvyj@awork3.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

On 2023-06-08 17:40:12 +0200, Tomas Vondra wrote:
> At pgcon unconference I presented a PoC patch adding prefetching for
> indexes, along with some benchmark results demonstrating the (pretty
> significant) benefits etc. The feedback was quite positive, so let me
> share the current patch more widely.

I'm really excited about this work.

> 1) pairing-heap in GiST / SP-GiST
>
> For most AMs, the index state is pretty trivial - matching items from a
> single leaf page. Prefetching that is pretty trivial, even if the
> current API is a bit cumbersome.
>
> Distance queries on GiST and SP-GiST are a problem, though, because
> those do not just read the pointers into a simple array, as the distance
> ordering requires passing stuff through a pairing-heap :-(
>
> I don't know how to best deal with that, especially not in the simple
> API. I don't think we can "scan forward" stuff from the pairing heap, so
> the only idea I have is actually having two pairing-heaps. Or maybe
> using the pairing heap for prefetching, but stashing the prefetched
> pointers into an array and then returning stuff from it.
>
> In the patch I simply prefetch items before we add them to the pairing
> heap, which is good enough for demonstrating the benefits.

I think it'd be perfectly fair to just not tackle distance queries for now.

> 2) prefetching from executor
>
> Another question is whether the prefetching shouldn't actually happen
> even higher - in the executor. That's what Andres suggested during the
> unconference, and it kinda makes sense. That's where we do prefetching
> for bitmap heap scans, so why should this happen lower, right?

Yea. I think it also provides potential for further optimizations in the
future to do it at that layer.

One thing I have been wondering around this is whether we should not have
split the code for IOS and plain indexscans...

> 4) per-leaf prefetching
>
> The code is restricted only prefetches items from one leaf page. If the
> index scan needs to scan multiple (many) leaf pages, we have to process
> the first leaf page first before reading / prefetching the next one.
>
> I think this is acceptable limitation, certainly for v0. Prefetching
> across multiple leaf pages seems way more complex (particularly for the
> cases using pairing heap), so let's leave this for the future.

Hm. I think that really depends on the shape of the API we end up with. If we
move the responsibility more twoards to the executor, I think it very well
could end up being just as simple to prefetch across index pages.

> 5) index-only scans
>
> I'm not sure what to do about index-only scans. On the one hand, the
> point of IOS is not to read stuff from the heap at all, so why prefetch
> it. OTOH if there are many allvisible=false pages, we still have to
> access that. And if that happens, this leads to the bizarre situation
> that IOS is slower than regular index scan. But to address this, we'd
> have to consider the visibility during prefetching.

That should be easy to do, right?

> Benchmark / TPC-H
> -----------------
>
> I ran the 22 queries on 100GB data set, with parallel query either
> disabled or enabled. And I measured timing (and speedup) for each query.
> The speedup results look like this (see the attached PDF for details):
>
> query serial parallel
> 1 101% 99%
> 2 119% 100%
> 3 100% 99%
> 4 101% 100%
> 5 101% 100%
> 6 12% 99%
> 7 100% 100%
> 8 52% 67%
> 10 102% 101%
> 11 100% 72%
> 12 101% 100%
> 13 100% 101%
> 14 13% 100%
> 15 101% 100%
> 16 99% 99%
> 17 95% 101%
> 18 101% 106%
> 19 30% 40%
> 20 99% 100%
> 21 101% 100%
> 22 101% 107%
>
> The percentage is (timing patched / master, so <100% means faster, >100%
> means slower).
>
> The different queries are affected depending on the query plan - many
> queries are close to 100%, which means "no difference". For the serial
> case, there are about 4 queries that improved a lot (6, 8, 14, 19),
> while for the parallel case the benefits are somewhat less significant.
>
> My explanation is that either (a) parallel case used a different plan
> with fewer index scans or (b) the parallel query does more concurrent
> I/O simply by using parallel workers. Or maybe both.
>
> There are a couple regressions too, I believe those are due to doing too
> much prefetching in some cases, and some of the heuristics mentioned
> earlier should eliminate most of this, I think.

I'm a bit confused by some of these numbers. How can OS-level prefetching lead
to massive prefetching in the alread cached case, e.g. in tpch q06 and q08?
Unless I missed what "xeon / cached (speedup)" indicates?

I think it'd be good to run a performance comparison of the unpatched vs
patched cases, with prefetching disabled for both. It's possible that
something in the patch caused unintended changes (say spilling during a
hashagg, due to larger struct sizes).

Greetings,

Andres Freund

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message David Rowley 2023-06-09 00:13:02 Re: Remove WindowClause PARTITION BY items belonging to redundant pathkeys
Previous Message Masahiko Sawada 2023-06-09 00:05:08 Re: Support logical replication of DDLs