From: | Andres Freund <andres(at)anarazel(dot)de> |
---|---|
To: | Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com> |
Cc: | PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Georgios <gkokolatos(at)protonmail(dot)com> |
Subject: | Re: index prefetching |
Date: | 2023-06-09 00:06:00 |
Message-ID: | 20230609000600.syqy447e6metnvyj@awork3.anarazel.de |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
Hi,
On 2023-06-08 17:40:12 +0200, Tomas Vondra wrote:
> At pgcon unconference I presented a PoC patch adding prefetching for
> indexes, along with some benchmark results demonstrating the (pretty
> significant) benefits etc. The feedback was quite positive, so let me
> share the current patch more widely.
I'm really excited about this work.
> 1) pairing-heap in GiST / SP-GiST
>
> For most AMs, the index state is pretty trivial - matching items from a
> single leaf page. Prefetching that is pretty trivial, even if the
> current API is a bit cumbersome.
>
> Distance queries on GiST and SP-GiST are a problem, though, because
> those do not just read the pointers into a simple array, as the distance
> ordering requires passing stuff through a pairing-heap :-(
>
> I don't know how to best deal with that, especially not in the simple
> API. I don't think we can "scan forward" stuff from the pairing heap, so
> the only idea I have is actually having two pairing-heaps. Or maybe
> using the pairing heap for prefetching, but stashing the prefetched
> pointers into an array and then returning stuff from it.
>
> In the patch I simply prefetch items before we add them to the pairing
> heap, which is good enough for demonstrating the benefits.
I think it'd be perfectly fair to just not tackle distance queries for now.
> 2) prefetching from executor
>
> Another question is whether the prefetching shouldn't actually happen
> even higher - in the executor. That's what Andres suggested during the
> unconference, and it kinda makes sense. That's where we do prefetching
> for bitmap heap scans, so why should this happen lower, right?
Yea. I think it also provides potential for further optimizations in the
future to do it at that layer.
One thing I have been wondering around this is whether we should not have
split the code for IOS and plain indexscans...
> 4) per-leaf prefetching
>
> The code is restricted only prefetches items from one leaf page. If the
> index scan needs to scan multiple (many) leaf pages, we have to process
> the first leaf page first before reading / prefetching the next one.
>
> I think this is acceptable limitation, certainly for v0. Prefetching
> across multiple leaf pages seems way more complex (particularly for the
> cases using pairing heap), so let's leave this for the future.
Hm. I think that really depends on the shape of the API we end up with. If we
move the responsibility more twoards to the executor, I think it very well
could end up being just as simple to prefetch across index pages.
> 5) index-only scans
>
> I'm not sure what to do about index-only scans. On the one hand, the
> point of IOS is not to read stuff from the heap at all, so why prefetch
> it. OTOH if there are many allvisible=false pages, we still have to
> access that. And if that happens, this leads to the bizarre situation
> that IOS is slower than regular index scan. But to address this, we'd
> have to consider the visibility during prefetching.
That should be easy to do, right?
> Benchmark / TPC-H
> -----------------
>
> I ran the 22 queries on 100GB data set, with parallel query either
> disabled or enabled. And I measured timing (and speedup) for each query.
> The speedup results look like this (see the attached PDF for details):
>
> query serial parallel
> 1 101% 99%
> 2 119% 100%
> 3 100% 99%
> 4 101% 100%
> 5 101% 100%
> 6 12% 99%
> 7 100% 100%
> 8 52% 67%
> 10 102% 101%
> 11 100% 72%
> 12 101% 100%
> 13 100% 101%
> 14 13% 100%
> 15 101% 100%
> 16 99% 99%
> 17 95% 101%
> 18 101% 106%
> 19 30% 40%
> 20 99% 100%
> 21 101% 100%
> 22 101% 107%
>
> The percentage is (timing patched / master, so <100% means faster, >100%
> means slower).
>
> The different queries are affected depending on the query plan - many
> queries are close to 100%, which means "no difference". For the serial
> case, there are about 4 queries that improved a lot (6, 8, 14, 19),
> while for the parallel case the benefits are somewhat less significant.
>
> My explanation is that either (a) parallel case used a different plan
> with fewer index scans or (b) the parallel query does more concurrent
> I/O simply by using parallel workers. Or maybe both.
>
> There are a couple regressions too, I believe those are due to doing too
> much prefetching in some cases, and some of the heuristics mentioned
> earlier should eliminate most of this, I think.
I'm a bit confused by some of these numbers. How can OS-level prefetching lead
to massive prefetching in the alread cached case, e.g. in tpch q06 and q08?
Unless I missed what "xeon / cached (speedup)" indicates?
I think it'd be good to run a performance comparison of the unpatched vs
patched cases, with prefetching disabled for both. It's possible that
something in the patch caused unintended changes (say spilling during a
hashagg, due to larger struct sizes).
Greetings,
Andres Freund
From | Date | Subject | |
---|---|---|---|
Next Message | David Rowley | 2023-06-09 00:13:02 | Re: Remove WindowClause PARTITION BY items belonging to redundant pathkeys |
Previous Message | Masahiko Sawada | 2023-06-09 00:05:08 | Re: Support logical replication of DDLs |