Re: index prefetching

From: Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Andres Freund <andres(at)anarazel(dot)de>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Georgios <gkokolatos(at)protonmail(dot)com>
Subject: Re: index prefetching
Date: 2023-12-20 01:41:11
Message-ID: 280dc83c-a16f-4424-1319-95e7e3f798bd@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 12/18/23 22:00, Robert Haas wrote:
> On Sat, Dec 9, 2023 at 1:08 PM Tomas Vondra
> <tomas(dot)vondra(at)enterprisedb(dot)com> wrote:
>> But there's a layering problem that I don't know how to solve - I don't
>> see how we could make indexam.c entirely oblivious to the prefetching,
>> and move it entirely to the executor. Because how else would you know
>> what to prefetch?
>
> Yeah, that seems impossible.
>
> Some thoughts:
>
> * I think perhaps the subject line of this thread is misleading. It
> doesn't seem like there is any index prefetching going on here at all,
> and there couldn't be, unless you extended the index AM API with new
> methods. What you're actually doing is prefetching heap pages that
> will be needed by a scan of the index. I think this confusing naming
> has propagated itself into some parts of the patch, e.g.
> index_prefetch() reads *from the heap* which is not at all clear from
> the comment saying "Prefetch the TID, unless it's sequential or
> recently prefetched." You're not prefetching the TID: you're
> prefetching the heap tuple to which the TID points. That's not an
> academic distinction IMHO -- the TID would be stored in the index, so
> if we were prefetching the TID, we'd have to be reading index pages,
> not heap pages.

Yes, that's a fair complaint. I think the naming is mostly obsolete -
the prefetching initially happened way way lower - in the index AMs. It
was prefetching the heap pages, ofc, but it kinda seemed reasonable to
call it "index prefetching". And even now it's called from indexam.c
where most functions start with "index_".

But I'll think about some better / cleared name.

>
> * Regarding layering, my first thought was that the changes to
> index_getnext_tid() and index_getnext_slot() are sensible: read ahead
> by some number of TIDs, keep the TIDs you've fetched in an array
> someplace, use that to drive prefetching of blocks on disk, and return
> the previously-read TIDs from the queue without letting the caller
> know that the queue exists. I think that's the obvious design for a
> feature of this type, to the point where I don't really see that
> there's a viable alternative design.

I agree.

> Driving something down into the individual index AMs would make sense
> if you wanted to prefetch *from the indexes*, but it's unnecessary
> otherwise, and best avoided.
>

Right. In fact, the patch moved exactly in the opposite direction - it
was originally done at the AM level, and moved up. First to indexam.c,
then even more to the executor.

> * But that said, the skip_all_visible flag passed down to
> index_prefetch() looks like a VERY strong sign that the layering here
> is not what it should be. Right now, when some code calls
> index_getnext_tid(), that function does not need to know or care
> whether the caller is going to fetch the heap tuple or not. But with
> this patch, the code does need to care. So knowledge of the executor
> concept of an index-only scan trickles down into indexam.c, which now
> has to be able to make decisions that are consistent with the ones
> that the executor will make. That doesn't seem good at all.
>

I agree the all_visible flag is a sign the abstraction is not quite
right. I did that mostly to quickly verify whether the duplicate VM
checks are causing for the perf regression (and they are).

Whatever the right abstraction is, it probably needs to do these VM
checks only once.

> * I think it might make sense to have two different prefetching
> schemes. Ideally they could share some structure. If a caller is using
> index_getnext_slot(), then it's easy for prefetching to be fully
> transparent. The caller can just ask for TIDs and the prefetching
> distance and TID queue can be fully under the control of something
> that is hidden from the caller. But when using index_getnext_tid(),
> the caller needs to have an opportunity to evaluate each TID and
> decide whether we even want the heap tuple. If yes, then we feed that
> TID to the prefetcher; if no, we don't. That way, we're not
> replicating executor logic in lower-level code. However, that also
> means that the IOS logic needs to be aware that this TID queue exists
> and interact with whatever controls the prefetch distance. Perhaps
> after calling index_getnext_tid() you call
> index_prefetcher_put_tid(prefetcher, tid, bool fetch_heap_tuple) and
> then you call index_prefetcher_get_tid() to drain the queue. Perhaps
> also the prefetcher has a "fill" callback that gets invoked when the
> TID queue isn't as full as the prefetcher wants it to be. Then
> index_getnext_slot() can just install a trivial fill callback that
> says index_prefetecher_put_tid(prefetcher, index_getnext_tid(...),
> true), but IOS can use a more sophisticated callback that checks the
> VM to determine what to pass for the third argument.
>

Yeah, after you pointed out the "leaky" abstraction, I also started to
think about customizing the behavior using a callback. Not sure what
exactly you mean by "fully transparent" but as I explained above I think
we need to allow passing some information between the prefetcher and the
executor - for example results of the visibility map checks in IOS.

I have imagined something like this:

nodeIndexscan / index_getnext_slot()
-> no callback, all TIDs are prefetched

nodeIndexonlyscan / index_getnext_tid()
-> callback checks VM for the TID, prefetches if not all-visible
-> the VM check result is stored in the queue with the VM (but in an
extensible way, so that other callback can store other stuff)
-> index_getnext_tid() also returns this extra information

So not that different from the WIP patch, but in a "generic" and
extensible way. Instead of hard-coding the all-visible flag, there'd be
a something custom information. A bit like qsort_r() has a void* arg to
pass custom context.

Or if envisioned something different, could you elaborate a bit?

> * I realize that I'm being a little inconsistent in what I just said,
> because in the first bullet point I said that this wasn't really index
> prefetching, and now I'm proposing function names that still start
> with index_prefetch. It's not entirely clear to me what the best thing
> to do about the terminology is here -- could it be a heap prefetcher,
> or a TID prefetcher, or an index scan prefetcher? I don't really know,
> but whatever we can do to make the naming more clear seems like a
> really good idea. Maybe there should be a clearer separation between
> the queue of TIDs that we're going to return from the index and the
> queue of blocks that we want to prefetch to get the corresponding heap
> tuples -- making that separation crisper might ease some of the naming
> issues.
>

I think if the code stays in indexam.c, it's sensible to keep the index_
prefix, but then also have a more appropriate rest of the name. For
example it might be index_prefetch_heap_pages() or something like that.

> * Not that I want to be critical because I think this is a great start
> on an important project, but it does look like there's an awful lot of
> stuff here that still needs to be sorted out before it would be
> reasonable to think of committing this, both in terms of design
> decisions and just general polish. There's a lot of stuff marked with
> XXX and I think that's great because most of those seem to be good
> questions but that does leave the, err, small problem of figuring out
> the answers.

Absolutely. I certainly don't claim this is close to commit ...

> index_prefetch_is_sequential() makes me really nervous
> because it seems to depend an awful lot on whether the OS is doing
> prefetching, and how the OS is doing prefetching, and I think those
> might not be consistent across all systems and kernel versions.

If the OS does not have read-ahead, or it's not configured properly,
then the patch does not perform worse than what we have now. I'm far
more concerned about the opposite issue, i.e. causing regressions with
OS-level read-ahead. And the check handles that well, I think.

> Similarly with index_prefetch(). There's a lot of "magical"
> assumptions here. Even index_prefetch_add_cache() has this problem --
> the function assumes that it's OK if we sometimes fail to detect a
> duplicate prefetch request, which makes sense, but under what
> circumstances is it necessary to detect duplicates and in what cases
> is it optional? The function comments are silent about that, which
> makes it hard to assess whether the algorithm is good enough.
>

I don't quite understand what problem with duplicates you envision here.
Strictly speaking, we don't need to detect/prevent duplicates - it's
just that if you do posix_fadvise() for a block that's already in
memory, it's overhead / wasted time. The whole point is to not do that
very often. In this sense it's entirely optional, but desirable.

I'm in no way claiming the comments are perfect, ofc.

> * In terms of polish, one thing I noticed is that index_getnext_slot()
> calls index_prefetch_tids() even when scan->xs_heap_continue is set,
> which seems like it must be a waste, since we can't really need to
> kick off more prefetch requests halfway through a HOT chain referenced
> by a single index tuple, can we?

Yeah, I think that's true.

> Also, blks_prefetch_rounds doesn't
> seem to be used anywhere, and neither that nor blks_prefetches are
> documented. In fact there's no new documentation at all, which seems
> probably not right. That's partly because there are no new GUCs, which
> I feel like typically for a feature like this would be the place where
> the feature behavior would be mentioned in the documentation.

That's mostly because the explain fields were added to help during
development. I'm not sure we actually want to make them part of EXPLAIN.

> I don't
> think it's a good idea to tie the behavior of this feature to
> effective_io_concurrency partly because it's usually a bad idea to
> make one setting control multiple different things, but perhaps even
> more because effective_io_concurrency doesn't actually work in a
> useful way AFAICT and people typically have to set it to some very
> artificially large value compared to how much real I/O parallelism
> they have. So probably there should be new GUCs with hopefully-better
> semantics, but at least the documentation for any existing ones would
> need updating, I would think.
>

I really don't want to have multiple knobs. At this point we have three
GUCs, each tuning prefetching for a fairly large part of the system:

effective_io_concurrency = regular queries
maintenance_io_concurrency = utility commands
recovery_prefetch = recovery / PITR

This seems sensible, but I really don't want many more GUCs tuning
prefetching for different executor nodes or something like that.

If we have issues with how effective_io_concurrency works (and I'm not
sure that's actually true), then perhaps we should fix that rather than
inventing new GUCs.

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Masahiko Sawada 2023-12-20 01:43:30 Re: Add isCatalogRel in rmgrdesc
Previous Message Masahiko Sawada 2023-12-20 01:19:02 Re: Improve eviction algorithm in ReorderBuffer