Re: index prefetching

From: Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Andres Freund <andres(at)anarazel(dot)de>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Georgios <gkokolatos(at)protonmail(dot)com>
Subject: Re: index prefetching
Date: 2024-01-12 16:42:39
Message-ID: 482ec3ff-52ad-415d-96fd-f3832a894023@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

Here's an improved version of this patch, finishing a lot of the stuff
that I alluded to earlier - moving the code from indexam.c, renaming a
bunch of stuff, etc. I've also squashed it into a single patch, to make
it easier to review.

I'll briefly go through the main changes in the patch, and then will
respond in-line to Robert's points.

1) I moved the code from indexam.c to (new) execPrefetch.c. All the
prototypes / typedefs now live in executor.h, with only minimal changes
in execnodes.h (adding it to scan descriptors).

I believe this finally moves the code to the right place - it feels much
nicer and cleaner than in indexam.c. And it allowed me to hide a bunch
of internal structs and improve the general API, I think.

I'm sure there's stuff that could be named differently, but the layering
feels about right, I think.

2) A bunch of stuff got renamed to start with IndexPrefetch... to make
the naming consistent / clearer. I'm not entirely sure IndexPrefetch is
the right name, though - it's still a bit misleading, as it might seem
it's about prefetching index stuff, but really it's about heap pages
from indexes. Maybe IndexScanPrefetch() or something like that?

3) If there's a way to make this work with the streaming I/O API, I'm
not aware of it. But the overall design seems somewhat similar (based on
"next" callback etc.) so hopefully that'd make it easier to adopt it.

4) I initially relied on parallelModeOK to disable prefetching, which
kinda worked, but not really. Robert suggested to use the execute_once
flag directly, and I think that's much better - not only is it cleaner,
it also seems more appropriate (the parallel flag considers other stuff
that is not quite relevant to prefetching).

Thinking about this, I think it should be possible to make prefetching
work even for plans with execute_once=false. In particular, when the
plan changes direction it should be possible to simply "walk back" the
prefetch queue, to get to the "correct" place in in the scan. But I'm
not sure it's worth it, because plans that change direction often can't
really benefit from prefetches anyway - they'll often visit stuff they
accessed shortly before anyway. For plans that don't change direction
but may pause, we don't know if the plan pauses long enough for the
prefetched pages to get evicted or something. So I think it's OK that
execute_once=false means no prefetching.

5) I haven't done anything about the xs_heap_continue=true case yet.

6) I went through all the comments and reworked them considerably. The
main comment at execPrefetch.c start, with some overall design etc. And
then there are comments for each function, explaining that bit in more
detail. Or at least that's the goal - there's still work to do.

There's two trivial FIXMEs, but you can ignore those - it's not that
there's a bug, but that I'd like to rework something and just don't know
how yet.

There's also a couple of XXX comments. Some are a bit wild ideas for the
future, others are somewhat "open questions" to be discussed during a
review.

Anyway, there should be no outright obsolete comments - if there's
something I missed, let me know.

Now to Robert's message ...

On 1/9/24 21:31, Robert Haas wrote:
> On Thu, Jan 4, 2024 at 9:55 AM Tomas Vondra
> <tomas(dot)vondra(at)enterprisedb(dot)com> wrote:
>> Here's a somewhat reworked version of the patch. My initial goal was to
>> see if it could adopt the StreamingRead API proposed in [1], but that
>> turned out to be less straight-forward than I hoped, for two reasons:
>
> I guess we need Thomas or Andres or maybe Melanie to comment on this.
>

Yeah. Or maybe Thomas if he has thoughts on how to combine this with the
streaming I/O stuff.

>> Perhaps a bigger change is that I decided to move this into a separate
>> API on top of indexam.c. The original idea was to integrate this into
>> index_getnext_tid/index_getnext_slot, so that all callers benefit from
>> the prefetching automatically. Which would be nice, but it also meant
>> it's need to happen in the indexam.c code, which seemed dirty.
>
> This patch is hard to review right now because there's a bunch of
> comment updating that doesn't seem to have been done for the new
> design. For instance:
>
> + * XXX This does not support prefetching of heap pages. When such
> prefetching is
> + * desirable, use index_getnext_tid().
>
> But not any more.
>

True. And this is now even more obsolete, as the prefetching was moved
from indexam.c layer to the executor.

> + * XXX The prefetching may interfere with the patch allowing us to evaluate
> + * conditions on the index tuple, in which case we may not need the heap
> + * tuple. Maybe if there's such filter, we should prefetch only pages that
> + * are not all-visible (and the same idea would also work for IOS), but
> + * it also makes the indexing a bit "aware" of the visibility stuff (which
> + * seems a somewhat wrong). Also, maybe we should consider the filter
> selectivity
>
> I'm not sure whether all the problems in this area are solved, but I
> think you've solved enough of them that this at least needs rewording,
> if not removing.
>
> + * XXX Comment/check seems obsolete.
>
> This occurs in two places. I'm not sure if it's accurate or not.
>
> + * XXX Could this be an issue for the prefetching? What if we
> prefetch something
> + * but the direction changes before we get to the read? If that
> could happen,
> + * maybe we should discard the prefetched data and go back? But can we even
> + * do that, if we already fetched some TIDs from the index? I don't think
> + * indexorderdir can't change, but es_direction maybe can?
>
> But your email claims that "The patch simply disables prefetching for
> such queries, using the same logic that we do for parallelism." FWIW,
> I think that's a fine way to handle that case.
>

True. I left behind this comment partly intentionally, to point out why
we disable the prefetching in these cases, but you're right the comment
now explains something that can't happen.

> + * XXX Maybe we should enable prefetching, but prefetch only pages that
> + * are not all-visible (but checking that from the index code seems like
> + * a violation of layering etc).
>
> Isn't this fixed now? Note this comment occurs twice.
>
> + * XXX We need to disable this in some cases (e.g. when using index-only
> + * scans, we don't want to prefetch pages). Or maybe we should prefetch
> + * only pages that are not all-visible, that'd be even better.
>
> Here again.
>

Sorry, you're right those comments (and a couple more nearby) were
stale. Removed / clarified.

> And now for some comments on other parts of the patch, mostly other
> XXX comments:
>
> + * XXX This does not support prefetching of heap pages. When such
> prefetching is
> + * desirable, use index_getnext_tid().
>
> There's probably no reason to write XXX here. The comment is fine.
>
> + * XXX Notice we haven't added the block to the block queue yet, and there
> + * is a preceding block (i.e. blockIndex-1 is valid).
>
> Same here, possibly? If this XXX indicates a defect in the code, I
> don't know what the defect is, so I guess it needs to be more clear.
> If it is just explaining the code, then there's no reason for the
> comment to say XXX.
>

Yeah, removed the XXX / reworded a bit.

> + * XXX Could it be harmful that we read the queue backwards? Maybe memory
> + * prefetching works better for the forward direction?
>
> It does. But I don't know whether that matters here or not.
>
> + * XXX We do add the cache size to the request in order not to
> + * have issues with uint64 underflows.
>
> I don't know what this means.
>

There's a check that does this:

(x + PREFETCH_CACHE_SIZE) >= y

it might also be done as "mathematically equivalent"

x >= (y - PREFETCH_CACHE_SIZE)

but if the "y" is an uint64, and the value is smaller than the constant,
this would underflow. It'd eventually disappear, once the "y" gets large
enough, ofc.

> + * XXX not sure this correctly handles xs_heap_continue - see
> index_getnext_slot,
> + * maybe nodeIndexscan needs to do something more to handle this?
> Although, that
> + * should be in the indexscan next_cb callback, probably.
> + *
> + * XXX If xs_heap_continue=true, we need to return the last TID.
>
> You've got a bunch of comments about xs_heap_continue here -- and I
> don't fully understand what the issues are here with respect to this
> particular patch, but I think that the general purpose of
> xs_heap_continue is to handle the case where we need to return more
> than one tuple from the same HOT chain. With an MVCC snapshot that
> doesn't happen, but with say SnapshotAny or SnapshotDirty, it could.
> As far as possible, the prefetcher shouldn't be involved at all when
> xs_heap_continue is set, I believe, because in that case we're just
> returning a bunch of tuples from the same page, and the extra fetches
> from that heap page shouldn't trigger or require any further
> prefetching.
>

Yes, that's correct. The current code simply ignores that flag and just
proceeds to the next TID. Which is correct for xs_heap_continue=false,
and thus all MVCC snapshots work fine. But for the Any/Dirty case it
needs to work a bit differently.

> + * XXX Should this also look at plan.plan_rows and maybe cap the target
> + * to that? Pointless to prefetch more than we expect to use. Or maybe
> + * just reset to that value during prefetching, after reading the next
> + * index page (or rather after rescan)?
>
> It seems questionable to use plan_rows here because (1) I don't think
> we have existing cases where we use the estimated row count in the
> executor for anything, we just carry it through so EXPLAIN can print
> it and (2) row count estimates can be really far off, especially if
> we're on the inner side of a nested loop, we might like to figure that
> out eventually instead of just DTWT forever. But on the other hand
> this does feel like an important case where we have a clue that
> prefetching might need to be done less aggressively or not at all, and
> it doesn't seem right to ignore that signal either. I wonder if we
> want this shaped in some other way, like a Boolean that says
> are-we-under-a-potentially-row-limiting-construct e.g. limit or inner
> side of a semi-join or anti-join.
>

The current code actually does look at plan_rows when calculating the
prefetch target:

prefetch_max = IndexPrefetchComputeTarget(node->ss.ss_currentRelation,
node->ss.ps.plan->plan_rows,
estate->es_use_prefetching);

but I agree maybe it should not, for the reasons you explain. I'm not
attached to this part.

> + * We reach here if the index only scan is not parallel, or if we're
> + * serially executing an index only scan that was planned to be
> + * parallel.
>
> Well, this seems sad.
>

Stale comment, I believe. However, I didn't see much benefits with
parallel index scan during testing. Having I/O from multiple workers
generally had the same effect, I think.

> + * XXX This might lead to IOS being slower than plain index scan, if the
> + * table has a lot of pages that need recheck.
>
> How?
>

The comment is not particularly clear what "this" means, but I believe
this was about index-only scan with many not-all-visible pages. If it
didn't do prefetching, a regular index scan with prefetching may be way
faster. But the code actually allows doing prefetching even for IOS, by
checking the vm in the "next" callback.

> + /*
> + * XXX Only allow index prefetching when parallelModeOK=true. This is a bit
> + * of a misuse of the flag, but we need to disable prefetching for cursors
> + * (which might change direction), and parallelModeOK does that. But maybe
> + * we might (or should) have a separate flag.
> + */
>
> I think the correct flag to be using here is execute_once, which
> captures whether the executor could potentially be invoked a second
> time for the same portal. Changes in the fetch direction are possible
> if and only if !execute_once.
>

Right. The new patch version does that.

>> Note 1: The IndexPrefetch name is a bit misleading, because it's used
>> even with prefetching disabled - all index reads from the index scan
>> happen through it. Maybe it should be called IndexReader or something
>> like that.
>
> My biggest gripe here is the capitalization. This version adds, inter
> alia, IndexPrefetchAlloc, PREFETCH_QUEUE_INDEX, and
> index_heap_prefetch_target, which seems like one or two too many
> conventions. But maybe the PREFETCH_* macros don't even belong in a
> public header.
>
> I do like the index_heap_prefetch_* naming. Possibly that's too
> verbose to use for everything, but calling this index-heap-prefetch
> rather than index-prefetch seems clearer.
>

Yeah. I renamed all the structs and functions to IndexPrefetchSomething,
to keep it consistent. And then the constants are all capital, ofc.

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachment Content-Type Size
v20240112-0001-Prefetch-heap-pages-during-index-scans.patch text/x-patch 54.3 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Robert Haas 2024-01-12 16:45:06 Re: psql not responding to SIGINT upon db reconnection
Previous Message Robert Haas 2024-01-12 16:28:09 Re: Error "initial slot snapshot too large" in create replication slot