| From: | Andres Freund <andres(at)anarazel(dot)de> | 
|---|---|
| To: | Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com> | 
| Cc: | Robert Haas <robertmhaas(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Georgios <gkokolatos(at)protonmail(dot)com> | 
| Subject: | Re: index prefetching | 
| Date: | 2023-12-21 13:43:14 | 
| Message-ID: | 20231221134314.wf2rs62d37u62j7t@alap3.anarazel.de | 
| Views: | Whole Thread | Raw Message | Download mbox | Resend email | 
| Thread: | |
| Lists: | pgsql-hackers | 
Hi,
On 2023-12-21 13:30:42 +0100, Tomas Vondra wrote:
> You're right a lot of this is a guesswork. I don't think we can do much
> better, because it depends on stuff that's out of our control - each OS
> may do things differently, or perhaps it's just configured differently.
> 
> But I don't think this is really a serious issue - all the read-ahead
> implementations need to work about the same, because they are meant to
> work in a transparent way.
> 
> So it's about deciding at which point we think this is a sequential
> pattern. Yes, the OS may use a slightly different threshold, but the
> exact value does not really matter - in the worst case we prefetch a
> couple more/fewer blocks.
> 
> The OS read-ahead can't really prefetch anything except sequential
> cases, so the whole question is "When does the access pattern get
> sequential enough?". I don't think there's a perfect answer, and I don't
> think we need a perfect one - we just need to be reasonably close.
For the streaming read interface (initially backed by fadvise, to then be
replaced by AIO) we found that it's clearly necessary to avoid fadvises in
cases of actual sequential IO - the overhead otherwise leads to easily
reproducible regressions.  So I don't think we have much choice.
> Also, while I don't want to lazily dismiss valid cases that might be
> affected by this, I think that sequential access for index paths is not
> that common (with the exception of clustered indexes).
I think sequential access is common in other cases as well. There's lots of
indexes where heap tids are almost perfectly correlated with index entries,
consider insert only insert-only tables and serial PKs or inserted_at
timestamp columns.  Even leaving those aside, for indexes with many entries
for the same key, we sort by tid these days, which will also result in
"runs" of sequential access.
> Obviously, the latter case has much more severe impact, but it depends
> on the exact workload / access pattern etc. The only "perfect" solution
> would be to actually check the page cache, but well - that seems to be
> fairly expensive.
> What I was envisioning was something self-tuning, based on the I/O we
> may do later. If the prefetcher decides to prefetch something, but finds
> it's already in cache, we'd increase the distance, to remember more
> blocks. Likewise, if a block is not prefetched but then requires I/O
> later, decrease the distance. That'd make it adaptive, but I don't think
> we actually have the info about I/O.
How would the prefetcher know that hte data wasn't in cache?
> Alternatively, I was thinking about moving the prefetches into a
> separate worker process (or multiple workers), so we'd just queue the
> request and all the overhead would be done by the worker. The main
> problem is the overhead of calling posix_fadvise() for blocks that are
> already in memory, and this would just move it to a separate backend. I
> wonder if that might even make the custom cache unnecessary / optional.
The AIO patchset provides this.
> AFAICS this seems similar to some of the AIO patch, I wonder what that
> plans to do. I need to check.
Yes, most of this exists there.  The difference that with the AIO you don't
need to prefetch, as you can just initiate the IO for real, and wait for it to
complete.
Greetings,
Andres Freund
| From | Date | Subject | |
|---|---|---|---|
| Next Message | Emre Hasegeli | 2023-12-21 13:45:47 | Re: "pgoutput" options missing on documentation | 
| Previous Message | Laurenz Albe | 2023-12-21 13:29:05 | Set log_lock_waits=on by default |