Re: index prefetching

From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com>
Cc: Andres Freund <andres(at)anarazel(dot)de>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Georgios <gkokolatos(at)protonmail(dot)com>
Subject: Re: index prefetching
Date: 2023-12-20 19:09:06
Message-ID: CA+Tgmob5Nh7sQO8GENzEDSjYJQuKCgfhWtwDLjMtdueFMCh05A@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, Dec 19, 2023 at 8:41 PM Tomas Vondra
<tomas(dot)vondra(at)enterprisedb(dot)com> wrote:
> Whatever the right abstraction is, it probably needs to do these VM
> checks only once.

Makes sense.

> Yeah, after you pointed out the "leaky" abstraction, I also started to
> think about customizing the behavior using a callback. Not sure what
> exactly you mean by "fully transparent" but as I explained above I think
> we need to allow passing some information between the prefetcher and the
> executor - for example results of the visibility map checks in IOS.

Agreed.

> I have imagined something like this:
>
> nodeIndexscan / index_getnext_slot()
> -> no callback, all TIDs are prefetched
>
> nodeIndexonlyscan / index_getnext_tid()
> -> callback checks VM for the TID, prefetches if not all-visible
> -> the VM check result is stored in the queue with the VM (but in an
> extensible way, so that other callback can store other stuff)
> -> index_getnext_tid() also returns this extra information
>
> So not that different from the WIP patch, but in a "generic" and
> extensible way. Instead of hard-coding the all-visible flag, there'd be
> a something custom information. A bit like qsort_r() has a void* arg to
> pass custom context.
>
> Or if envisioned something different, could you elaborate a bit?

I can't totally follow the sketch you give above, but I think we're
thinking along similar lines, at least.

> I think if the code stays in indexam.c, it's sensible to keep the index_
> prefix, but then also have a more appropriate rest of the name. For
> example it might be index_prefetch_heap_pages() or something like that.

Yeah, that's not a bad idea.

> > index_prefetch_is_sequential() makes me really nervous
> > because it seems to depend an awful lot on whether the OS is doing
> > prefetching, and how the OS is doing prefetching, and I think those
> > might not be consistent across all systems and kernel versions.
>
> If the OS does not have read-ahead, or it's not configured properly,
> then the patch does not perform worse than what we have now. I'm far
> more concerned about the opposite issue, i.e. causing regressions with
> OS-level read-ahead. And the check handles that well, I think.

I'm just not sure how much I believe that it's going to work well
everywhere. I mean, I have no evidence that it doesn't, it just kind
of looks like guesswork to me. For instance, the behavior of the
algorithm depends heavily on PREFETCH_QUEUE_HISTORY and
PREFETCH_SEQ_PATTERN_BLOCKS, but those are just magic numbers. Who is
to say that on some system or workload you didn't test the required
values aren't entirely different, or that the whole algorithm doesn't
need rethinking? Maybe we can't really answer that question perfectly,
but the patch doesn't really explain the reasoning behind this choice
of algorithm.

> > Similarly with index_prefetch(). There's a lot of "magical"
> > assumptions here. Even index_prefetch_add_cache() has this problem --
> > the function assumes that it's OK if we sometimes fail to detect a
> > duplicate prefetch request, which makes sense, but under what
> > circumstances is it necessary to detect duplicates and in what cases
> > is it optional? The function comments are silent about that, which
> > makes it hard to assess whether the algorithm is good enough.
>
> I don't quite understand what problem with duplicates you envision here.
> Strictly speaking, we don't need to detect/prevent duplicates - it's
> just that if you do posix_fadvise() for a block that's already in
> memory, it's overhead / wasted time. The whole point is to not do that
> very often. In this sense it's entirely optional, but desirable.

Right ... but the patch sets up some data structure that will
eliminate duplicates in some circumstances and fail to eliminate them
in others. So it's making a judgement that the things it catches are
the cases that are important enough that we need to catch them, and
the things that it doesn't catch are cases that aren't particularly
important to catch. Here again, PREFETCH_LRU_SIZE and
PREFETCH_LRU_COUNT seem like they will have a big impact, but why
these values? The comments suggest that it's because we want to cover
~8MB of data, but it's not clear why that should be the right amount
of data to cover. My naive thought is that we'd want to avoid
prefetching a block during the time between we had prefetched it and
when we later read it, but then the value that is here magically 8MB
should really be replaced by the operative prefetch distance.

> I really don't want to have multiple knobs. At this point we have three
> GUCs, each tuning prefetching for a fairly large part of the system:
>
> effective_io_concurrency = regular queries
> maintenance_io_concurrency = utility commands
> recovery_prefetch = recovery / PITR
>
> This seems sensible, but I really don't want many more GUCs tuning
> prefetching for different executor nodes or something like that.
>
> If we have issues with how effective_io_concurrency works (and I'm not
> sure that's actually true), then perhaps we should fix that rather than
> inventing new GUCs.

Well, that would very possibly be a good idea, but I still think using
the same GUC for two different purposes is likely to cause trouble. I
think what effective_io_concurrency currently controls is basically
the heap prefetch distance for bitmap scans, and what you want to
control here is the heap prefetch distance for index scans. If those
are necessarily related in some understandable way (e.g. always the
same, one twice the other, one the square of the other) then it's fine
to use the same parameter for both, but it's not clear to me that this
is the case. I fear someone will find that if they crank up
effective_io_concurrency high enough to get the amount of prefetching
they want for bitmap scans, it will be too much for index scans, or
the other way around.

--
Robert Haas
EDB: http://www.enterprisedb.com

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Jeff Davis 2023-12-20 19:13:12 Re: Built-in CTYPE provider
Previous Message Daniel Gustafsson 2023-12-20 18:00:50 Re: Remove MSVC scripts from the tree