Re: index prefetching

From: Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com>
To: Andres Freund <andres(at)anarazel(dot)de>
Cc: PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Georgios <gkokolatos(at)protonmail(dot)com>
Subject: Re: index prefetching
Date: 2023-06-09 10:18:11
Message-ID: a062ac09-c574-1d6a-22b6-f3cc0d28ed42@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 6/9/23 02:06, Andres Freund wrote:
> Hi,
>
> On 2023-06-08 17:40:12 +0200, Tomas Vondra wrote:
>> At pgcon unconference I presented a PoC patch adding prefetching for
>> indexes, along with some benchmark results demonstrating the (pretty
>> significant) benefits etc. The feedback was quite positive, so let me
>> share the current patch more widely.
>
> I'm really excited about this work.
>
>
>> 1) pairing-heap in GiST / SP-GiST
>>
>> For most AMs, the index state is pretty trivial - matching items from a
>> single leaf page. Prefetching that is pretty trivial, even if the
>> current API is a bit cumbersome.
>>
>> Distance queries on GiST and SP-GiST are a problem, though, because
>> those do not just read the pointers into a simple array, as the distance
>> ordering requires passing stuff through a pairing-heap :-(
>>
>> I don't know how to best deal with that, especially not in the simple
>> API. I don't think we can "scan forward" stuff from the pairing heap, so
>> the only idea I have is actually having two pairing-heaps. Or maybe
>> using the pairing heap for prefetching, but stashing the prefetched
>> pointers into an array and then returning stuff from it.
>>
>> In the patch I simply prefetch items before we add them to the pairing
>> heap, which is good enough for demonstrating the benefits.
>
> I think it'd be perfectly fair to just not tackle distance queries for now.
>

My concern is that if we cut this from v0 entirely, we'll end up with an
API that'll not be suitable for adding distance queries later.

>
>> 2) prefetching from executor
>>
>> Another question is whether the prefetching shouldn't actually happen
>> even higher - in the executor. That's what Andres suggested during the
>> unconference, and it kinda makes sense. That's where we do prefetching
>> for bitmap heap scans, so why should this happen lower, right?
>
> Yea. I think it also provides potential for further optimizations in the
> future to do it at that layer.
>
> One thing I have been wondering around this is whether we should not have
> split the code for IOS and plain indexscans...
>

Which code? We already have nodeIndexscan.c and nodeIndexonlyscan.c? Or
did you mean something else?

>
>> 4) per-leaf prefetching
>>
>> The code is restricted only prefetches items from one leaf page. If the
>> index scan needs to scan multiple (many) leaf pages, we have to process
>> the first leaf page first before reading / prefetching the next one.
>>
>> I think this is acceptable limitation, certainly for v0. Prefetching
>> across multiple leaf pages seems way more complex (particularly for the
>> cases using pairing heap), so let's leave this for the future.
>
> Hm. I think that really depends on the shape of the API we end up with. If we
> move the responsibility more twoards to the executor, I think it very well
> could end up being just as simple to prefetch across index pages.
>

Maybe. I'm open to that idea if you have idea how to shape the API to
make this possible (although perhaps not in v0).

>
>> 5) index-only scans
>>
>> I'm not sure what to do about index-only scans. On the one hand, the
>> point of IOS is not to read stuff from the heap at all, so why prefetch
>> it. OTOH if there are many allvisible=false pages, we still have to
>> access that. And if that happens, this leads to the bizarre situation
>> that IOS is slower than regular index scan. But to address this, we'd
>> have to consider the visibility during prefetching.
>
> That should be easy to do, right?
>

It doesn't seem particularly complicated (famous last words), and we
need to do the VM checks anyway so it seems like it wouldn't add a lot
of overhead either

>
>
>> Benchmark / TPC-H
>> -----------------
>>
>> I ran the 22 queries on 100GB data set, with parallel query either
>> disabled or enabled. And I measured timing (and speedup) for each query.
>> The speedup results look like this (see the attached PDF for details):
>>
>> query serial parallel
>> 1 101% 99%
>> 2 119% 100%
>> 3 100% 99%
>> 4 101% 100%
>> 5 101% 100%
>> 6 12% 99%
>> 7 100% 100%
>> 8 52% 67%
>> 10 102% 101%
>> 11 100% 72%
>> 12 101% 100%
>> 13 100% 101%
>> 14 13% 100%
>> 15 101% 100%
>> 16 99% 99%
>> 17 95% 101%
>> 18 101% 106%
>> 19 30% 40%
>> 20 99% 100%
>> 21 101% 100%
>> 22 101% 107%
>>
>> The percentage is (timing patched / master, so <100% means faster, >100%
>> means slower).
>>
>> The different queries are affected depending on the query plan - many
>> queries are close to 100%, which means "no difference". For the serial
>> case, there are about 4 queries that improved a lot (6, 8, 14, 19),
>> while for the parallel case the benefits are somewhat less significant.
>>
>> My explanation is that either (a) parallel case used a different plan
>> with fewer index scans or (b) the parallel query does more concurrent
>> I/O simply by using parallel workers. Or maybe both.
>>
>> There are a couple regressions too, I believe those are due to doing too
>> much prefetching in some cases, and some of the heuristics mentioned
>> earlier should eliminate most of this, I think.
>
> I'm a bit confused by some of these numbers. How can OS-level prefetching lead
> to massive prefetching in the alread cached case, e.g. in tpch q06 and q08?
> Unless I missed what "xeon / cached (speedup)" indicates?
>

I forgot to explain what "cached" means in the TPC-H case. It means
second execution of the query, so you can imagine it like this:

for q in `seq 1 22`; do

1. drop caches and restart postgres

2. run query $q -> uncached

3. run query $q -> cached

done

So the second execution has a chance of having data in memory - but
maybe not all, because this is a 100GB data set (so ~200GB after
loading), but the machine only has 64GB of RAM.

I think a likely explanation is some of the data wasn't actually in
memory, so prefetching still did something.

> I think it'd be good to run a performance comparison of the unpatched vs
> patched cases, with prefetching disabled for both. It's possible that
> something in the patch caused unintended changes (say spilling during a
> hashagg, due to larger struct sizes).
>

That's certainly a good idea. I'll do that in the next round of tests. I
also plan to do a test on data set that fits into RAM, to test "properly
cached" case.

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Dagfinn Ilmari Mannsåker 2023-06-09 10:26:38 Re: Error in calculating length of encoded base64 string
Previous Message Drouvot, Bertrand 2023-06-09 10:11:30 Re: Introduce WAIT_EVENT_EXTENSION and WAIT_EVENT_BUFFER_PIN