Re: index prefetching

From: Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com>
To: Melanie Plageman <melanieplageman(at)gmail(dot)com>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Georgios <gkokolatos(at)protonmail(dot)com>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, Konstantin Knizhnik <knizhnik(at)garret(dot)ru>, Dilip Kumar <dilipbalaut(at)gmail(dot)com>
Subject: Re: index prefetching
Date: 2024-02-13 19:00:59
Message-ID: 4f5f16ef-df1e-4e09-9b3f-2e0961ab5117@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 2/7/24 22:48, Melanie Plageman wrote:
> ...
>
> Attached is a patch which implements a real queue and fixes some of
> the issues with the previous version. It doesn't pass tests yet and
> has issues. Some are bugs in my implementation I need to fix. Some are
> issues we would need to solve in the streaming read API. Some are
> issues with index prefetching generally.
>
> Note that these two patches have to be applied before 21d9c3ee4e
> because Thomas hasn't released a rebased version of the streaming read
> API patches yet.
>

Thanks for working on this, and for investigating the various issues.

> Issues
> ---
> - kill prior tuple
>
> This optimization doesn't work with index prefetching with the current
> design. Kill prior tuple relies on alternating between fetching a
> single index tuple and visiting the heap. After visiting the heap we
> can potentially kill the immediately preceding index tuple. Once we
> fetch multiple index tuples, enqueue their TIDs, and later visit the
> heap, the next index page we visit may not contain all of the index
> tuples deemed killable by our visit to the heap.
>

I admit I haven't thought about kill_prior_tuple until you pointed out.
Yeah, prefetching separates (de-synchronizes) the two scans (index and
heap) in a way that prevents this optimization. Or at least makes it
much more complex :-(

> In our case, we could try and fix this by prefetching only heap blocks
> referred to by index tuples on the same index page. Or we could try
> and keep a pool of index pages pinned and go back and kill index
> tuples on those pages.
>

I think restricting the prefetching to a single index page would not be
a huge issue performance-wise - that's what the initial patch version
(implemented at the index AM level) did, pretty much. The prefetch queue
would get drained as we approach the end of the index page, but luckily
index pages tend to have a lot of entries. But it'd put an upper bound
on the prefetch distance (much lower than the e_i_c maximum 1000, but
I'd say common values are 10-100 anyway).

But how would we know we're on the same index page? That knowledge is
not available outside the index AM - the executor or indexam.c does not
know this, right? Presumably we could expose this, somehow, but it seems
like a violation of the abstraction ...

The same thing affects keeping multiple index pages pinned, for TIDs
that are yet to be used by the index scan. We'd need to know when to
release a pinned page, once we're done with processing all items.

FWIW I haven't tried to implementing any of this, so maybe I'm missing
something and it can be made to work in a nice way.

> Having disabled kill_prior_tuple is why the mvcc test fails. Perhaps
> there is an easier way to fix this, as I don't think the mvcc test
> failed on Tomas' version.
>

I kinda doubt it worked correctly, considering I simply ignored the
optimization. It's far more likely it just worked by luck.

> - switching scan directions
>
> If the index scan switches directions on a given invocation of
> IndexNext(), heap blocks may have already been prefetched and read for
> blocks containing tuples beyond the point at which we want to switch
> directions.
>
> We could fix this by having some kind of streaming read "reset"
> callback to drop all of the buffers which have been prefetched which
> are now no longer needed. We'd have to go backwards from the last TID
> which was yielded to the caller and figure out which buffers in the
> pgsr buffer ranges are associated with all of the TIDs which were
> prefetched after that TID. The TIDs are in the per_buffer_data
> associated with each buffer in pgsr. The issue would be searching
> through those efficiently.
>

Yeah, that's roughly what I envisioned in one of my previous messages
about this issue - walking back the TIDs read from the index and added
to the prefetch queue.

> The other issue is that the streaming read API does not currently
> support backwards scans. So, if we switch to a backwards scan from a
> forwards scan, we would need to fallback to the non streaming read
> method. We could do this by just setting the TID queue size to 1
> (which is what I have currently implemented). Or we could add
> backwards scan support to the streaming read API.
>

What do you mean by "support for backwards scans" in the streaming read
API? I imagined it naively as

1) drop all requests in the streaming read API queue

2) walk back all "future" requests in the TID queue

3) start prefetching as if from scratch

Maybe there's a way to optimize this and reuse some of the work more
efficiently, but my assumption is that the scan direction does not
change very often, and that we process many items in between.

> - mark and restore
>
> Similar to the issue with switching the scan direction, mark and
> restore requires us to reset the TID queue and streaming read queue.
> For now, I've hacked in something to the PlannerInfo and Plan to set
> the TID queue size to 1 for plans containing a merge join (yikes).
>

Haven't thought about this very much, will take a closer look.

> - multiple executions
>
> For reasons I don't entirely understand yet, multiple executions (not
> rescan -- see ExecutorRun(...execute_once)) do not work. As in Tomas'
> patch, I have disabled prefetching (and made the TID queue size 1)
> when execute_once is false.
>

Don't work in what sense? What is (not) happening?

> - Index Only Scans need to return IndexTuples
>
> Because index only scans return either the IndexTuple pointed to by
> IndexScanDesc->xs_itup or the HeapTuple pointed to by
> IndexScanDesc->xs_hitup -- both of which are populated by the index
> AM, we have to save copies of those IndexTupleData and HeapTupleDatas
> for every TID whose block we prefetch.
>
> This might be okay, but it is a bit sad to have to make copies of those tuples.
>
> In this patch, I still haven't figured out the memory management part.
> I copy over the tuples when enqueuing a TID queue item and then copy
> them back again when the streaming read API returns the
> per_buffer_data to us. Something is still not quite right here. I
> suspect this is part of the reason why some of the other tests are
> failing.
>

It's not clear to me what you need to copy the tuples back - shouldn't
it be enough to copy the tuple just once?

FWIW if we decide to pin multiple index pages (to make kill_prior_tuple
work), that would also mean we don't need to copy any tuples, right? We
could point into the buffers for all of them, right?

> Other issues/gaps in my implementation:
>
> Determining where to allocate the memory for the streaming read object
> and the TID queue is an outstanding TODO. To implement a fallback
> method for cases in which streaming read doesn't work, I set the queue
> size to 1. This is obviously not good.
>

I think IndexFetchTableData seems like a not entirely terrible place for
allocating the pgsr, but I wonder what Andres thinks about this. IIRC he
advocated for doing the prefetching in executor, and I'm not sure
heapam_handled.c + relscan.h is what he imagined ...

Also, when you say "obviously not good" - why? Are you concerned about
the extra overhead of shuffling stuff between queues, or something else?

> Right now, I allocate the TID queue and streaming read objects in
> IndexNext() and IndexOnlyNext(). This doesn't seem ideal. Doing it in
> index_beginscan() (and index_beginscan_parallel()) is tricky though
> because we don't know the scan direction at that point (and the scan
> direction can change). There are also callers of index_beginscan() who
> do not call Index[Only]Next() (like systable_getnext() which calls
> index_getnext_slot() directly).
>

Yeah, not sure this is the right layering ... the initial patch did
everything in individual index AMs, then it moved to indexam.c, then to
executor. And this seems to move it to lower layers again ...

> Also, my implementation does not yet have the optimization Tomas does
> to skip prefetching recently prefetched blocks. As he has said, it
> probably makes sense to add something to do this in a lower layer --
> such as in the streaming read API or even in bufmgr.c (maybe in
> PrefetchSharedBuffer()).
>

I agree this should happen in lower layers. I'd probably do this in the
streaming read API, because that would define "scope" of the cache
(pages prefetched for that read). Doing it in PrefetchSharedBuffer seems
like it would do a single cache (for that particular backend).

But that's just an initial thought ...

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Justin Pryzby 2024-02-13 19:10:20 Re: CI and test improvements
Previous Message Tom Lane 2024-02-13 18:46:17 Re: Fix overflow hazard in interval rounding