Re: index prefetching

From: Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com>
To: Jakub Wartak <jakub(dot)wartak(at)enterprisedb(dot)com>
Cc: Konstantin Knizhnik <knizhnik(at)garret(dot)ru>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: index prefetching
Date: 2024-03-01 14:58:38
Message-ID: 8a5c7d0d-180f-4055-8d33-4312cfccfee0@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

Thanks for looking at the patch!

On 3/1/24 09:20, Jakub Wartak wrote:
> On Wed, Jan 24, 2024 at 7:13 PM Tomas Vondra
> <tomas(dot)vondra(at)enterprisedb(dot)com> wrote:
> [
>>
>> (1) Melanie actually presented a very different way to implement this,
>> relying on the StreamingRead API. So chances are this struct won't
>> actually be used.
>
> Given lots of effort already spent on this and the fact that is thread
> is actually two:
>
> a. index/table prefetching since Jun 2023 till ~Jan 2024
> b. afterwards index/table prefetching with Streaming API, but there
> are some doubts of whether it could happen for v17 [1]
>
> ... it would be pitty to not take benefits of such work (even if
> Streaming API wouldn't be ready for this; although there's lots of
> movement in the area), so I've played a little with with the earlier
> implementation from [2] without streaming API as it already received
> feedback, it demonstrated big benefits, and earlier it got attention
> on pgcon unconference. Perhaps, some of those comment might be passed
> later to the "b"-patch (once that's feasible):
>

TBH I don't have a clear idea what to do. It'd be cool to have at least
some benefits in v17, but I don't know how to do that in a way that
would be useful in the future.

For example, the v20240124 patch implements this in the executor, but
based on the recent discussions it seems that's not the right layer -
the index AM needs to have some control, and I'm not convinced it's
possible to improve it in that direction (even ignoring the various
issues we identified in the executor-based approach).

I think it might be more practical to do this from the index AM, even if
it has various limitations. Ironically, that's what I proposed at pgcon,
but mostly because it was the quick&dirty way to do this.

> 1. v20240124-0001-Prefetch-heap-pages-during-index-scans.patch does
> not apply cleanly anymore, due show_buffer_usage() being quite
> recently refactored in 5de890e3610d5a12cdaea36413d967cf5c544e20 :
>
> patching file src/backend/commands/explain.c
> Hunk #1 FAILED at 3568.
> Hunk #2 FAILED at 3679.
> 2 out of 2 hunks FAILED -- saving rejects to file
> src/backend/commands/explain.c.rej
>
> 2. v2 applies (fixup), but it would nice to see that integrated into
> main patch (it adds IndexOnlyPrefetchInfo) into one patch
>

Yeah, but I think it was an old patch version, no point in rebasing that
forever. Also, I'm not really convinced the executor-level approach is
the right path forward.

> 3. execMain.c :
>
> + * XXX It might be possible to improve the prefetching code
> to handle this
> + * by "walking back" the TID queue, but it's not clear if
> it's worth it.
>
> Shouldn't we just remove the XXX? The walking-back seems to be niche
> so are fetches using cursors when looking at real world users queries
> ? (support cases bias here when looking at peopel's pg_stat_activity)
>
> 4. Wouldn't it be better to leave PREFETCH_LRU_SIZE at static of 8,
> but base PREFETCH_LRU_COUNT on effective_io_concurrency instead?
> (allowing it to follow dynamically; the more prefetches the user wants
> to perform, the more you spread them across shared LRUs and the more
> memory for history is required?)
>
> + * XXX Maybe we could consider effective_cache_size when sizing the cache?
> + * Not to size the cache for that, ofc, but maybe as a guidance of how many
> + * heap pages it might keep. Maybe just a fraction fraction of the value,
> + * say Max(8MB, effective_cache_size / max_connections) or something.
> + */
> +#define PREFETCH_LRU_SIZE 8 /* slots in one LRU */
> +#define PREFETCH_LRU_COUNT 128 /* number of LRUs */
> +#define PREFETCH_CACHE_SIZE (PREFETCH_LRU_SIZE *
> PREFETCH_LRU_COUNT)
>

I don't see why would this be related to effective_io_concurrency? It's
merely about how many recently accessed pages we expect to find in the
page cache. It's entirely separate from the prefetch distance.

> BTW:
> + * heap pages it might keep. Maybe just a fraction fraction of the value,
> that's a duplicated "fraction" word over there.
>
> 5.
> + * XXX Could it be harmful that we read the queue backwards?
> Maybe memory
> + * prefetching works better for the forward direction?
>
> I wouldn't care, we are optimizing I/O (and context-switching) which
> weighs much more than memory access direction impact and Dilipi
> earlier also expressed no concern, so maybe it could be also removed
> (one less "XXX" to care about)
>

Yeah, I think it's negligible. Probably a microoptimization we can
investigate later, I don't want to complicate the code unnecessarily.

> 6. in IndexPrefetchFillQueue()
>
> + while (!PREFETCH_QUEUE_FULL(prefetch))
> + {
> + IndexPrefetchEntry *entry
> + = prefetch->next_cb(scan, direction, prefetch->data);
>
> If we are at it... that's a strange split and assignment not indented :^)
>
> 7. in IndexPrefetchComputeTarget()
>
> + * XXX We cap the target to plan_rows, becausse it's pointless to prefetch
> + * more than we expect to use.
>
> That's a nice fact that's already in patch, so XXX isn't needed?
>

Right, which is why it's not a TODO/FIXME. But I think it's good to
point this out - I'm not 100% convinced we should be using plan_rows
like this (because what happens if the estimate happens to be wrong?).

> 8.
> + * XXX Maybe we should reduce the value with parallel workers?
>
> I was assuming it could be a good idea, but the same doesn't seem
> (eic/actual_parallel_works_per_gather) to be performed for bitmap heap
> scan prefetches, so no?
>

Yeah, if we don't do that now, I'm not sure this patch should change
that behavior.

> 9.
> + /*
> + * No prefetching for direct I/O.
> + *
> + * XXX Shouldn't we do prefetching even for direct I/O? We would only
> + * pretend doing it now, ofc, because we'd not do posix_fadvise(), but
> + * once the code starts loading into shared buffers, that'd work.
> + */
> + if ((io_direct_flags & IO_DIRECT_DATA) != 0)
> + return 0;
>
> It's redundant (?) and could be removed as
> PrefetchBuffer()->PrefetchSharedBuffer() already has this at line 571:
>
> 5 #ifdef USE_PREFETCH
> 4 │ │ /*
> 3 │ │ │* Try to initiate an asynchronous read. This
> returns false in
> 2 │ │ │* recovery if the relation file doesn't exist.
> 1 │ │ │*/
> 571 │ │ if ((io_direct_flags & IO_DIRECT_DATA) == 0 &&
> 1 │ │ │ smgrprefetch(smgr_reln, forkNum, blockNum, 1))
> 2 │ │ {
> 3 │ │ │ result.initiated_io = true;
> 4 │ │ }
> 5 #endif> > > > > > > /* USE_PREFETCH */
>

Yeah, I think it might be redundant. I think it allowed skipping a bunch
things without prefetching (like initialization of the prefetcher), but
after the reworks that's no longer true.

> 11. in IndexPrefetchStats() and ExecReScanIndexScan()
>
> + * FIXME Should be only in debug builds, or something like that.
>
> + /* XXX Print some debug stats. Should be removed. */
> + IndexPrefetchStats(indexScanDesc, node->iss_prefetch);
>
> Hmm, but it could be useful in tuning the real world systems, no? E.g.
> recovery prefetcher gives some info through pg_stat_recovery_prefetch
> view, but e.g. bitmap heap scans do not provide us with anything at
> all. I don't have a strong opinion. Exposing such stuff would take
> away your main doubt (XXX) from execPrefetch.c

You're right it'd be good to collect/expose such statistics, to help
with monitoring/tuning, etc. But I think there are better / more
convenient ways to do this - exposing that in EXPLAIN, and adding a
counter to pgstat_all_tables / pgstat_all_indexes.

> ``auto-tuning/self-adjustment". And if we are at it, we could think in
> far future about adding new session GUC track_cachestat or EXPLAIN
> (cachestat/prefetch, analyze) (this new syscall for Linux >= 6.5)
> where we could present both index stats (as what IndexPrefetchStats()
> does) *and* cachestat() results there for interested users. Of course
> it would have to be generic enough for the bitmap heap scan case too.
> Such insight would also allow fine tuning eic, PREFETCH_LRU_COUNT,
> PREFETCH_QUEUE_HISTORY. Just an idea.
>

I haven't really thought about this, but I agree some auto-tuning would
be very helpful (assuming it's sufficiently reliable).

> 12.
>
> + * XXX Maybe we should reduce the target in case this is
> a parallel index
> + * scan. We don't want to issue a multiple of
> effective_io_concurrency.
>
> in IndexOnlyPrefetchCleanup() and IndexNext()
>
> + * XXX Maybe we should reduce the value with parallel workers?
>
> It's redundant XXX-comment (there are two for the same), as you it was
> already there just before IndexPrefetchComputeTarget()
>
> 13. The previous bitmap prefetch code uses #ifdef USE_PREFETCH, maybe
> it would make some sense to follow the consistency pattern , to avoid
> adding implementation on platforms without prefetching ?
>

Perhaps, but I'm not sure how to do that with the executor-based
approach, where essentially everything goes through the prefetch queue
(except that the prefetch distance is 0). So the amount of code that
would be disabled by the ifdef would be tiny.

> 14. The patch is missing documentation, so how about just this?
>
> --- a/doc/src/sgml/config.sgml
> +++ b/doc/src/sgml/config.sgml
> @@ -2527,7 +2527,8 @@ include_dir 'conf.d'
> operations that any individual
> <productname>PostgreSQL</productname> session
> attempts to initiate in parallel. The allowed range is 1 to 1000,
> or zero to disable issuance of asynchronous I/O requests. Currently,
> - this setting only affects bitmap heap scans.
> + this setting only enables prefetching for HEAP data blocks
> when performing
> + bitmap heap scans and index (only) scans.
> </para>
>
> Some further tests, given data:
>
> CREATE TABLE test (id bigint, val bigint, str text);
> ALTER TABLE test ALTER COLUMN str SET STORAGE EXTERNAL;
> INSERT INTO test SELECT g, g, repeat(chr(65 + (10*random())::int),
> 3000) FROM generate_series(1, 10000) g;
> -- or INSERT INTO test SELECT x.r, x.r, repeat(chr(65 +
> (10*random())::int), 3000) from (select 10000 * random() as r from
> generate_series(1, 10000)) x;
> VACUUM ANALYZE test;
> CREATE INDEX on test (id) ;
>

It's not clear to me what's the purpose of this test? Can you explain?

> 1. the patch correctly detects sequential access (e.g. we issue up to
> 6 fadvise() syscalls (8kB each) out and 17 preads() to heap fd for
> query like `SELECT sum(val) FROM test WHERE id BETWEEN 10 AND 2000;`
> -- offset of fadvise calls and pread match), so that's good.
>
> 2. Prefetching for TOASTed heap seems to be not implemented at all,
> correct? (Is my assumption that we should go like this:
> t_index->t->toast_idx->toast_heap)?, but I'm too newbie to actually
> see the code path where it could be added - certainly it's not blocker
> -- but maybe in commit message a list of improvements for future could
> be listed?):
>

Yes, that's true. I haven't thought about TOAST very much, but with
prefetching happening in executor, that does not work. There'd need to
be some extra code for TOAST prefetching. I'm not sure how beneficial
that would be, considering most TOAST values tend to be stored on
consecutive heap pages.

> 2024-02-29 11:45:14.259 CET [11098] LOG: index prefetch stats:
> requests 1990 prefetches 17 (0.854271) skip cached 0 sequential 1973
> 2024-02-29 11:45:14.259 CET [11098] STATEMENT: SELECT
> md5(string_agg(md5(str),',')) FROM test WHERE id BETWEEN 10 AND 2000;
>
> fadvise64(37, 40960, 8192, POSIX_FADV_WILLNEED) = 0
> pread64(50, "\0\0\0\0\350Jv\1\0\0\4\0(\0\0\10\0 \4
> \0\0\0\0\20\230\340\17\0\224 \10"..., 8192, 2998272) = 8192
> pread64(49, "\0\0\0\0(at)Hw\1\0\0\0\0\324\5\0\t\360\37\4 \0\0\0\0\340\237
> \0\320\237 \0"..., 8192, 40960) = 8192
> pread64(50, "\0\0\0\0\2200v\1\0\0\4\0(\0\0\10\0 \4
> \0\0\0\0\20\230\340\17\0\224 \10"..., 8192, 2990080) = 8192
> pread64(50, "\0\0\0\08\26v\1\0\0\4\0(\0\0\10\0 \4
> \0\0\0\0\20\230\340\17\0\224 \10"..., 8192, 2981888) = 8192
> pread64(50, "\0\0\0\0\340\373u\1\0\0\4\0(\0\0\10\0 \4
> \0\0\0\0\20\230\340\17\0\224 \10"..., 8192, 2973696) = 8192
> [..no fadvises for fd=50 which was pg_toast_rel..]
>
> 3. I'm not sure if I got good-enough results for DESCending index
> `create index on test (id DESC);`- with eic=16 it doesnt seem to be
> be able prefetch 16 blocks in advance? (e.g. highlight offset 557056
> below in some text editor and it's distance is far lower between that
> fadvise<->pread):
>
> pread64(45, "\0\0\0\0x\305b\3\0\0\4\0\370\1\0\2\0 \4
> \0\0\0\0\300\237t\0\200\237t\0"..., 8192, 0) = 8192
> fadvise64(45, 417792, 8192, POSIX_FADV_WILLNEED) = 0
> pread64(45, "\0\0\0\0\370\330\235\4\0\0\4\0\370\1\0\2\0 \4
> \0\0\0\0\300\237t\0\200\237t\0"..., 8192, 417792) = 8192
> fadvise64(45, 671744, 8192, POSIX_FADV_WILLNEED) = 0
> fadvise64(45, 237568, 8192, POSIX_FADV_WILLNEED) = 0
> pread64(45, "\0\0\0\08`]\5\0\0\4\0\370\1\0\2\0 \4
> \0\0\0\0\300\237t\0\200\237t\0"..., 8192, 671744) = 8192
> fadvise64(45, 491520, 8192, POSIX_FADV_WILLNEED) = 0
> fadvise64(45, 360448, 8192, POSIX_FADV_WILLNEED) = 0
> pread64(45, "\0\0\0\0\200\357\25\4\0\0\4\0\370\1\0\2\0 \4
> \0\0\0\0\300\237t\0\200\237t\0"..., 8192, 237568) = 8192
> fadvise64(45, 557056, 8192, POSIX_FADV_WILLNEED) = 0
> fadvise64(45, 106496, 8192, POSIX_FADV_WILLNEED) = 0
> pread64(45, "\0\0\0\0\240s\325\4\0\0\4\0\370\1\0\2\0 \4
> \0\0\0\0\300\237t\0\200\237t\0"..., 8192, 491520) = 8192
> fadvise64(45, 401408, 8192, POSIX_FADV_WILLNEED) = 0
> fadvise64(45, 335872, 8192, POSIX_FADV_WILLNEED) = 0
> pread64(45, "\0\0\0\0\250\233r\4\0\0\4\0\370\1\0\2\0 \4
> \0\0\0\0\300\237t\0\200\237t\0"..., 8192, 360448) = 8192
> fadvise64(45, 524288, 8192, POSIX_FADV_WILLNEED) = 0
> fadvise64(45, 352256, 8192, POSIX_FADV_WILLNEED) = 0
> pread64(45, "\0\0\0\0\240\342\6\5\0\0\4\0\370\1\0\2\0 \4
> \0\0\0\0\300\237t\0\200\237t\0"..., 8192, 557056) = 8192
>

I'm not sure I understand these strace snippets. Can you elaborate a
bit, explain what the strace log says?

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tomas Vondra 2024-03-01 15:18:54 Re: index prefetching
Previous Message Fabrice Chapuis 2024-03-01 14:46:45 double precisoin type