Re: PoC: prefetching index leaf pages (for inserts)

From: Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com>
To: Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Cc: Andres Freund <andres(at)anarazel(dot)de>
Subject: Re: PoC: prefetching index leaf pages (for inserts)
Date: 2023-11-23 14:46:31
Message-ID: 1ee3d0e5-da72-3eb0-a024-c3e862625b5e@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 11/23/23 14:26, Heikki Linnakangas wrote:
> On 06/11/2023 19:05, Tomas Vondra wrote:
>> As for the path forward, I think the prefetching is demonstrably
>> beneficial. There are cases where it can't help or even harms
>> performance. I think the success depends on three areas:
>>
>> (a) reducing the costs of the prefetching - For example right now we
>> build the index tuples twice (once for prefetch, once for the insert),
>> but maybe there's a way to do that only once? There are also predicate
>> indexes, and so on.
>>
>> (b) being smarter about when to prefetch - For example if we only have
>> one "prefetchable" index, it's somewhat pointless to prefetch (for
>> single-row cases). And so on.
>>
>> (c) not prefetching when already cached - This is somewhat related to
>> the previous case, but perhaps it'd be cheaper to first check if the
>> data is already cached. For shared buffers it should not be difficult,
>> for page cache we could use preadv2 with RWF_NOWAIT flag. The question
>> is if this is cheap enough to be cheaper than just doing posix_fadvise
>> (which however only deals with shared buffers).
>
> I don't like this approach. It duplicates the tree-descend code, and it
> also duplicates the work of descending the tree at runtime. And it only
> addresses index insertion; there are a lot of places that could benefit
> from prefetching or async execution like this.
>

Yeah, I think that's a fair assessment, although I think the amount of
duplicate code is pretty small (and perhaps it could be refactored to a
common function, which I chose not to do in the PoC patch).

> I think we should think of this as async execution rather than
> prefetching. We don't have the general infrastructure for writing async
> code, but if we did, this would be much simpler. In an async programming
> model, like you have in many other languages like Rust, python or
> javascript, there would be no separate prefetching function. Instead,
> aminsert() would return a future that can pause execution if it needs to
> do I/O. Something like this:
>
> aminsert_futures = NIL;
> /* create a future for each index insert */
> for (<all indexes>)
> {
>     aminsert_futures = lappend(aminsert_futures, aminsert(...));
> }
> /* wait for all the futures to finish */
> await aminsert_futures;
>
> The async-aware aminsert function would run to completion quickly if all
> the pages are already in cache. If you get a cache miss, it would start
> an async I/O read for the page, and yield to the other insertions until
> the I/O completes.
>
> We already support async execution of FDWs now, with the
> ForeignAsyncRequest() and ForeignAsyncConfigureWait() callbacks. Can we
> generalize that?
>

Interesting idea. I kinda like it in principle, however I'm not very
familiar with how our async execution works (and perhaps even with async
implementations in general), so I can't quite say how difficult would it
be to do something like that in an AM (instead of an executor).

Where exactly would be the boundary between who "creates" and "executes"
the requests, what would be the flow of execution? For the FDW it seems
fairly straightforward, because the boundary is local/remote, and the
async request is executed in a separate process.

But here everything would happen locally, so how would that work?

Imagine we're inserting a tuple into two indexes. There's a bunch of
index pages we may need to read for each index - first some internal
pages, then some leafs. Presumably we'd want to do each page read as
asynchronous, and allow transfer of control to the other index.

IIUC the async-aware aminsert() would execute an async request for the
first page it needs to read, with a callback that essentially does the
next step of index descent - reads the page, determines the next page to
read, and then do another async request. Then it'd sleep, which would
allow transfer of control to the aminsert() on the other index. And then
we'd do a similar thing for the leaf pages.

Or do I imagine things wrong?

The thing I like about this async approach is that it would allow
prefetching all index pages, while my PoC patch simply assumes all
internal pages are in cache and prefetches only the first leaf page.
That's much simpler in terms of control flow, but has clear limits.

I however wonder if there are concurrency issues. Imagine there's a
COPY, i.e. we're inserting a batch of tuples. Can you run the aminsert()
for all the tuples concurrently? Won't that have issues with the
different "async threads" modifying the index for the other threads?

If those concurrent "insert threads" would be an issue, maybe we could
make amprefetch() async-aware ...

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message jacktby jacktby 2023-11-23 14:48:45 Does pg support to write a new buffer cache as an extension?
Previous Message Magnus Hagander 2023-11-23 14:45:29 Re: [HACKERS] pg_upgrade vs vacuum_cost_delay