Re: index prefetching

From: Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Andres Freund <andres(at)anarazel(dot)de>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Georgios <gkokolatos(at)protonmail(dot)com>
Subject: Re: index prefetching
Date: 2023-12-21 12:30:42
Message-ID: 98ba4b25-fae8-c1f4-1597-8093375a1986@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 12/20/23 20:09, Robert Haas wrote:
> On Tue, Dec 19, 2023 at 8:41 PM Tomas Vondra
> ...
>> I have imagined something like this:
>>
>> nodeIndexscan / index_getnext_slot()
>> -> no callback, all TIDs are prefetched
>>
>> nodeIndexonlyscan / index_getnext_tid()
>> -> callback checks VM for the TID, prefetches if not all-visible
>> -> the VM check result is stored in the queue with the VM (but in an
>> extensible way, so that other callback can store other stuff)
>> -> index_getnext_tid() also returns this extra information
>>
>> So not that different from the WIP patch, but in a "generic" and
>> extensible way. Instead of hard-coding the all-visible flag, there'd be
>> a something custom information. A bit like qsort_r() has a void* arg to
>> pass custom context.
>>
>> Or if envisioned something different, could you elaborate a bit?
>
> I can't totally follow the sketch you give above, but I think we're
> thinking along similar lines, at least.
>

Yeah, it's hard to discuss vague descriptions of code that does not
exist yet. I'll try to do the actual patch, then we can discuss.

>>> index_prefetch_is_sequential() makes me really nervous
>>> because it seems to depend an awful lot on whether the OS is doing
>>> prefetching, and how the OS is doing prefetching, and I think those
>>> might not be consistent across all systems and kernel versions.
>>
>> If the OS does not have read-ahead, or it's not configured properly,
>> then the patch does not perform worse than what we have now. I'm far
>> more concerned about the opposite issue, i.e. causing regressions with
>> OS-level read-ahead. And the check handles that well, I think.
>
> I'm just not sure how much I believe that it's going to work well
> everywhere. I mean, I have no evidence that it doesn't, it just kind
> of looks like guesswork to me. For instance, the behavior of the
> algorithm depends heavily on PREFETCH_QUEUE_HISTORY and
> PREFETCH_SEQ_PATTERN_BLOCKS, but those are just magic numbers. Who is
> to say that on some system or workload you didn't test the required
> values aren't entirely different, or that the whole algorithm doesn't
> need rethinking? Maybe we can't really answer that question perfectly,
> but the patch doesn't really explain the reasoning behind this choice
> of algorithm.
>

You're right a lot of this is a guesswork. I don't think we can do much
better, because it depends on stuff that's out of our control - each OS
may do things differently, or perhaps it's just configured differently.

But I don't think this is really a serious issue - all the read-ahead
implementations need to work about the same, because they are meant to
work in a transparent way.

So it's about deciding at which point we think this is a sequential
pattern. Yes, the OS may use a slightly different threshold, but the
exact value does not really matter - in the worst case we prefetch a
couple more/fewer blocks.

The OS read-ahead can't really prefetch anything except sequential
cases, so the whole question is "When does the access pattern get
sequential enough?". I don't think there's a perfect answer, and I don't
think we need a perfect one - we just need to be reasonably close.

Also, while I don't want to lazily dismiss valid cases that might be
affected by this, I think that sequential access for index paths is not
that common (with the exception of clustered indexes).

FWIW bitmap index scans have exactly the same "problem" except that no
one cares about it because that's how it worked from the start, so it's
not considered a regression.

>>> Similarly with index_prefetch(). There's a lot of "magical"
>>> assumptions here. Even index_prefetch_add_cache() has this problem --
>>> the function assumes that it's OK if we sometimes fail to detect a
>>> duplicate prefetch request, which makes sense, but under what
>>> circumstances is it necessary to detect duplicates and in what cases
>>> is it optional? The function comments are silent about that, which
>>> makes it hard to assess whether the algorithm is good enough.
>>
>> I don't quite understand what problem with duplicates you envision here.
>> Strictly speaking, we don't need to detect/prevent duplicates - it's
>> just that if you do posix_fadvise() for a block that's already in
>> memory, it's overhead / wasted time. The whole point is to not do that
>> very often. In this sense it's entirely optional, but desirable.
>
> Right ... but the patch sets up some data structure that will
> eliminate duplicates in some circumstances and fail to eliminate them
> in others. So it's making a judgement that the things it catches are
> the cases that are important enough that we need to catch them, and
> the things that it doesn't catch are cases that aren't particularly
> important to catch. Here again, PREFETCH_LRU_SIZE and
> PREFETCH_LRU_COUNT seem like they will have a big impact, but why
> these values? The comments suggest that it's because we want to cover
> ~8MB of data, but it's not clear why that should be the right amount
> of data to cover. My naive thought is that we'd want to avoid
> prefetching a block during the time between we had prefetched it and
> when we later read it, but then the value that is here magically 8MB
> should really be replaced by the operative prefetch distance.
>

True. Ideally we'd not issue prefetch request for data that's already in
memory - either in shared buffers or page cache (or whatever). And we
already do that for shared buffers, but not for page cache. The preadv2
experiment was an attempt to do that, but it's too expensive to help.

So we have to approximate, and the only way I can think of is checking
if we recently prefetched that block. Which is the whole point of this
simple cache - remembering which blocks we prefetched, so that we don't
prefetch them over and over again.

I don't understand what you mean by "cases that are important enough".
In a way, all the blocks are equally important, with exactly the same
impact of making the wrong decision.

You're certainly right the 8MB is a pretty arbitrary value, though. It
seemed reasonable, so I used that, but I might just as well use 32MB or
some other sensible value. Ultimately, any hard-coded value is going to
be wrong, but the negative consequences are a bit asymmetrical. If the
cache is too small, we may end up doing prefetches for data that's
already in cache. If it's too large, we may not prefetch data that's not
in memory at that point.

Obviously, the latter case has much more severe impact, but it depends
on the exact workload / access pattern etc. The only "perfect" solution
would be to actually check the page cache, but well - that seems to be
fairly expensive.

What I was envisioning was something self-tuning, based on the I/O we
may do later. If the prefetcher decides to prefetch something, but finds
it's already in cache, we'd increase the distance, to remember more
blocks. Likewise, if a block is not prefetched but then requires I/O
later, decrease the distance. That'd make it adaptive, but I don't think
we actually have the info about I/O.

A bigger "flaw" is that these caches are per-backend, so there's no way
to check if a block was recently prefetched by some other backend. I
actually wonder if maybe this cache should be in shared memory, but I
haven't tried.

Alternatively, I was thinking about moving the prefetches into a
separate worker process (or multiple workers), so we'd just queue the
request and all the overhead would be done by the worker. The main
problem is the overhead of calling posix_fadvise() for blocks that are
already in memory, and this would just move it to a separate backend. I
wonder if that might even make the custom cache unnecessary / optional.

AFAICS this seems similar to some of the AIO patch, I wonder what that
plans to do. I need to check.

>> I really don't want to have multiple knobs. At this point we have three
>> GUCs, each tuning prefetching for a fairly large part of the system:
>>
>> effective_io_concurrency = regular queries
>> maintenance_io_concurrency = utility commands
>> recovery_prefetch = recovery / PITR
>>
>> This seems sensible, but I really don't want many more GUCs tuning
>> prefetching for different executor nodes or something like that.
>>
>> If we have issues with how effective_io_concurrency works (and I'm not
>> sure that's actually true), then perhaps we should fix that rather than
>> inventing new GUCs.
>
> Well, that would very possibly be a good idea, but I still think using
> the same GUC for two different purposes is likely to cause trouble. I
> think what effective_io_concurrency currently controls is basically
> the heap prefetch distance for bitmap scans, and what you want to
> control here is the heap prefetch distance for index scans. If those
> are necessarily related in some understandable way (e.g. always the
> same, one twice the other, one the square of the other) then it's fine
> to use the same parameter for both, but it's not clear to me that this
> is the case. I fear someone will find that if they crank up
> effective_io_concurrency high enough to get the amount of prefetching
> they want for bitmap scans, it will be too much for index scans, or
> the other way around.
>

I understand, but I think we should really try to keep the number of
knobs as low as possible, unless we actually have very good arguments
for having separate GUCs. And I don't think we have that.

This is very much about how many concurrent requests the storage can
handle (or rather requires to benefit from the capabilities), and that's
pretty orthogonal to which operation is generating the requests.

I think this is pretty similar to what we do with work_mem - there's one
value for all possible parts of the query plan, no matter if it's sort,
group by, or something else. We do have separate limits for maintenance
commands, because that's a different matter, and we have the same for
the two I/O GUCs.

If we come to the realization that really need two GUCs, fine with me.
But at this point I don't see a reason to do that.

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Ishaan Adarsh 2023-12-21 12:37:39 Re: [DOC] Introducing Quick Start Guide to PL/pgSQL and PL/Python Documentation
Previous Message Andres Freund 2023-12-21 12:27:38 Re: Building PosgresSQL with LLVM fails on Solaris 11.4