Re: Custom explain options

From: Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com>
To: Konstantin Knizhnik <knizhnik(at)garret(dot)ru>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Custom explain options
Date: 2024-01-15 15:08:05
Message-ID: e654159f-2033-4333-9404-c9bb898a1866@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 1/15/24 15:22, Konstantin Knizhnik wrote:
>
> On 14/01/2024 11:47 pm, Tomas Vondra wrote:
>> The thing that was not clear to me is who decides what to prefetch,
>> which code issues the prefetch requests etc. In the github links you
>> shared I see it happens in the index AM code (in nbtsearch.c).
>
>
> It is up to the particular plan node (seqscan, indexscan,...) which
> pages to prefetch.
>
>
>>
>> That's interesting, because that's what my first prefetching patch did
>> too - not the same way, ofc, but in the same layer. Simply because it
>> seemed like the simplest way to do that. But the feedback was that's the
>> wrong layer, and that it should happen in the executor. And I agree with
>> that - the reasons are somewhere in the other thread.
>>
> I read the arguments in
>
> https://www.postgresql.org/message-id/flat/8c86c3a6-074e-6c88-3e7e-9452b6a37b9b%40enterprisedb.com#fc792f8d013215ace7971535a5744c83
>
> Separating prefetch info in index scan descriptor is really good idea.
> It will be amazing to have generic prefetch mechanism for all indexes.
> But unfortunately I do not understand how it is possible. The logic of
> index traversal is implemented inside AM. Executor doesn't know it.
> For example for B-Tree scan we can prefetch:
>
> - intermediate pages
> - leave pages
> - referenced by TID heap pages
>

My patch does not care about prefetching internal index pages. Yes, it's
a limitation, but my assumption is the internal pages are maybe 0.1% of
the index, and typically very hot / cached. Yes, if the index is not
used very often, this may be untrue. But I consider it a possible future
improvement, for some other patch. FWIW there's a prefetching patch for
inserts into indexes (which only prefetches just the index leaf pages).

> Before we load next intermediate page, we do not know next leave pages.
> And before we load next leave page, we can not find out TIDs from this
> page.
>

Not sure I understand what this is about. The patch simply calls the
index AM function index_getnext_tid() enough times to fill the prefetch
queue. It does not prefetch the next index leaf page, it however does
prefetch the heap pages. It does not "stall" at the boundary of the
index leaf page, or something.

> Another challenge - is how far we should prefetch (as far as I
> understand both your and our approach using dynamically extended
> prefetch window)
>

By dynamic extension of prefetch window you mean the incremental growth
of the prefetch distance from 0 to effective_io_concurrency? I don't
think there's a better solution.

There might be additional information that we could consider (e.g.
expected number of rows for the plan, earlier executions of the scan,
...) but each of these has a failure more.

>> Based on what I saw in the neon code, I think it should be possible for
>> neon to use "my" approach too, but that only works for the index scans,
>> ofc. Not sure what to do about the other places.
> We definitely need prefetch for heap scan (it gives the most advantages
> in performance), for vacuum  and also for pg_prewarm. Also I tried to
> implement it for custom indexes such as pg_vector. I still not sure
> whether it is possible to create some generic solution which will work
> for all indexes.
>

I haven't tried with pgvector, but I don't see why my patch would not
work for all index AMs that cna return TID.

> I have also tried to implement alternative approach for prefetch based
> on access statistic.
> It comes from use case of seqscan of table with larger toasted records.
> So for each record we have to extract its TOAST data.
> It is done using standard index scan, but unfortunately index prefetch
> doesn't help much here: there is usually just one TOAST segment and so
> prefetch just have no chance to do something useful. But as far as heap
> records are accessed sequentially, there is good chance that toast table
> will also be accessed mostly sequentially. So we just can count number
> of sequential requests to each relation and if ratio or seq/rand 
> accesses is above some threshold we can prefetch next pages of this
> relation. This is really universal approach but ... working mostly for
> TOAST table.
>

Are you're talking about what works / doesn't work in neon, or about
postgres in general?

I'm not sure what you mean by "one TOAST segment" and I'd also guess
that if both tables are accessed mostly sequentially, the read-ahead
will do most of the work (in postgres).

It's probably true that as we do a separate index scan for each TOAST-ed
value, that can't really ramp-up the prefetch distance fast enough.
Maybe we could have a mode where we start with the full distance?

>
>>> As I already wrote - prefetch is done locally for each backend. And each
>>> backend has its own connection with page server. It  can be changed in
>>> future when we implement multiplexing of page server connections. But
>>> right now prefetch is local. And certainly prefetch can improve
>>> performance only if we correctly predict subsequent page requests.
>>> If not - then page server does useless jobs and backend has to waity and
>>> consume all issues prefetch requests. This is why in prefetch
>>> implementation for most of nodes we  start with minimal prefetch
>>> distance and then increase it. It allows to perform prefetch only for
>>> such queries where it is really efficient (OLAP) and doesn't degrade
>>> performance of simple OLTP queries.
>>>
>> Not sure I understand what's so important about prefetches being "local"
>> for each backend. I mean even in postgres each backend prefetches it's
>> own buffers, no matter what the other backends do. Although, neon
>> probably doesn't have the cross-backend sharing through shared buffers
>> etc. right?
>
>
> Sorry if my explanation was not clear:(
>
>> I mean even in postgres each backend prefetches it's own buffers, no
>> matter what the other backends do.
>
> This is exactly the difference. In Neon such approach doesn't work.
> Each backend maintains it's own prefetch ring. And if prefetched page
> was not actually received, then the whole pipe is lost.
> I.e. backend prefetched pages 1,5,10. Then it need to read page 2. So it
> has to consume responses for 1,5,10 and issue another request for page 2.
> Instead of improving speed we are just doing extra job.
> So each backend should prefetch only those pages which it is actually
> going to read.
> This is why prefetch approach used in Postgres for example for parallel
> bitmap heap scan doesn't work for Neon.
> If you do `posic_fadvise` then prefetched page is placed in OS cache and
> can be used by any parallel worker.
> But in Neon each parallel worker should be given its own range of pages
> to scan and prefetch only this pages.
>

I still don't quite see/understand the difference. I mean, even in
postgres each backend does it's own prefetches, using it's own prefetch
ring. But I'm not entirely sure about the neon architecture differences.

Does this mean neon can do prefetching from the executor in principle?

Could you perhaps describe a situation where the bitmap can prefetching
(as implemented in Postgres) does not work for neon?

>>
>>> Well, my assumption was the following: prefetch is most efficient
>>> forOLAP queries.
>>> Although HTAP (hybrid transactional/analytical processing) is popular
>>> trend now,
>>> classical model is that analytic queries are performed on "historical"
>>> data, which was already proceeded by vacuum and all-visible bits were
>>> set in VM.
>>> May be this assumption is wrong but it seems to me that if most heap
>>> pages are not marked as all-visible, then  optimizer should prefetch
>>> bitmap scan to index-only scan.
>> I think this assumption is generally reasonable, but it hinges on the
>> assumption that OLAP queries have most indexes recently vacuumed and
>> all-visible. I'm not sure it's wise to rely on that.
>>
>> Without prefetching it's not that important - the worst thing that would
>> happen is that the IOS degrades into regular index-scan.
>>
> I think that it is also problem without prefetch. There are cases where
> seqscan or bitmap heap scan are really much faster then IOS because last
> one has to perform a lot of visibility checks. Yes, certainly optimizer
> takes in account percent of all-visible pages.But with it is not tricial
> to adjust optimizer parameters so that it can really choose fastest plan.

True. There's more cases where it can happen, no doubt about it. But I
think those cases are somewhat less likely.

>>   But withprefetching these plans can "invert" with respect to cost.
>>
>> I'm not saying it's terrible or that IOS must have prefetching, but I
>> think it's something users may run into fairly often. And it led me to
>> rework the prefetching so that IOS can prefetch too ...
>>
>>
>
> I think that inspecting VM for prefetch is really good idea.
>
>> Thanks! Very helpful. As I said, I ended up moving the prefetching to
>> the executor. For indexscans I think it should be possible for neon to
>> benefit from that (in a way, it doesn't need to do anything except for
>> overriding what PrefetchBuffer does). Not sure about the other places
>> where neon needs to prefetch, I don't have ambition to rework those.
>>
> Once your PR will be merged, I will rewrite Neon prefetch implementation
> fopr indexces using your approach.
>

Well, maybe you could try doing rewriting it now, so that you can give
some feedback to the patch. I'd appreciate that.

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Alexander Korotkov 2024-01-15 15:17:33 Re: POC PATCH: copy from ... exceptions to: (was Re: VLDB Features)
Previous Message Nathan Bossart 2024-01-15 14:50:25 Re: archive modules loose ends