Re: index prefetching

From: Jakub Wartak <jakub(dot)wartak(at)enterprisedb(dot)com>
To: Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com>
Cc: Konstantin Knizhnik <knizhnik(at)garret(dot)ru>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: index prefetching
Date: 2024-03-05 13:00:12
Message-ID: CAKZiRmzao9mQC3aX1nXFOCybwL+EX9YWmy+GLkQ2u56SdAVBWw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Fri, Mar 1, 2024 at 3:58 PM Tomas Vondra
<tomas(dot)vondra(at)enterprisedb(dot)com> wrote:
[..]
> TBH I don't have a clear idea what to do. It'd be cool to have at least
> some benefits in v17, but I don't know how to do that in a way that
> would be useful in the future.
>
> For example, the v20240124 patch implements this in the executor, but
> based on the recent discussions it seems that's not the right layer -
> the index AM needs to have some control, and I'm not convinced it's
> possible to improve it in that direction (even ignoring the various
> issues we identified in the executor-based approach).
>
> I think it might be more practical to do this from the index AM, even if
> it has various limitations. Ironically, that's what I proposed at pgcon,
> but mostly because it was the quick&dirty way to do this.

... that's a pity! :( Well, then let's just finish that subthread, I
gave some explanations, but I'll try to take a look in future
revisions.

> > 4. Wouldn't it be better to leave PREFETCH_LRU_SIZE at static of 8,
> > but base PREFETCH_LRU_COUNT on effective_io_concurrency instead?
> > (allowing it to follow dynamically; the more prefetches the user wants
> > to perform, the more you spread them across shared LRUs and the more
> > memory for history is required?)
> >
> > + * XXX Maybe we could consider effective_cache_size when sizing the cache?
> > + * Not to size the cache for that, ofc, but maybe as a guidance of how many
> > + * heap pages it might keep. Maybe just a fraction fraction of the value,
> > + * say Max(8MB, effective_cache_size / max_connections) or something.
> > + */
> > +#define PREFETCH_LRU_SIZE 8 /* slots in one LRU */
> > +#define PREFETCH_LRU_COUNT 128 /* number of LRUs */
> > +#define PREFETCH_CACHE_SIZE (PREFETCH_LRU_SIZE *
> > PREFETCH_LRU_COUNT)
> >
>
> I don't see why would this be related to effective_io_concurrency? It's
> merely about how many recently accessed pages we expect to find in the
> page cache. It's entirely separate from the prefetch distance.

Well, my thought was the higher eic is - the more I/O parallelism we
are introducing - in such a case, the more requests we need to
remember from the past to avoid prefetching the same (N * eic, where N
would be some multiplier)

> > 7. in IndexPrefetchComputeTarget()
> >
> > + * XXX We cap the target to plan_rows, becausse it's pointless to prefetch
> > + * more than we expect to use.
> >
> > That's a nice fact that's already in patch, so XXX isn't needed?
> >
>
> Right, which is why it's not a TODO/FIXME.

OH! That explains it to me. I've taken all of the XXXs as literally
FIXME that you wanted to go away (things to be removed before the
patch is considered mature).

> But I think it's good to
> point this out - I'm not 100% convinced we should be using plan_rows
> like this (because what happens if the estimate happens to be wrong?).

Well, somewhat similiar problematic pattern was present in different
codepath - get_actual_variable_endpoint() - see [1], 9c6ad5eaa95. So
the final fix was to get away without adding new GUC (which always an
option...), but just introduce a sensible hard-limit (fence) and stick
to the 100 heap visited pages limit. Here we could have similiar
heuristics same from start: if (plan_rows <
we_have_already_visited_pages * avgRowsPerBlock) --> ignore plan_rows
and rampup prefetches back to the full eic value.

> > Some further tests, given data:
> >
> > CREATE TABLE test (id bigint, val bigint, str text);
> > ALTER TABLE test ALTER COLUMN str SET STORAGE EXTERNAL;
> > INSERT INTO test SELECT g, g, repeat(chr(65 + (10*random())::int),
> > 3000) FROM generate_series(1, 10000) g;
> > -- or INSERT INTO test SELECT x.r, x.r, repeat(chr(65 +
> > (10*random())::int), 3000) from (select 10000 * random() as r from
> > generate_series(1, 10000)) x;
> > VACUUM ANALYZE test;
> > CREATE INDEX on test (id) ;
> >
>
> It's not clear to me what's the purpose of this test? Can you explain?

It's just schema&data preparation for the tests below:

> >
> > 2. Prefetching for TOASTed heap seems to be not implemented at all,
> > correct? (Is my assumption that we should go like this:
> > t_index->t->toast_idx->toast_heap)?, but I'm too newbie to actually
> > see the code path where it could be added - certainly it's not blocker
> > -- but maybe in commit message a list of improvements for future could
> > be listed?):
> >
>
> Yes, that's true. I haven't thought about TOAST very much, but with
> prefetching happening in executor, that does not work. There'd need to
> be some extra code for TOAST prefetching. I'm not sure how beneficial
> that would be, considering most TOAST values tend to be stored on
> consecutive heap pages.

Assuming that in the above I've generated data using cyclic / random
version and I run:

SELECT md5(string_agg(md5(str),',')) FROM test WHERE id BETWEEN 10 AND 2000;

(btw: I wanted to use octet_length() at first instead of string_agg()
but that's not enough)

where fd 45,54,55 correspond to :
lrwx------ 1 postgres postgres 64 Mar 5 12:56 /proc/8221/fd/45 ->
/tmp/blah/base/5/16384 // "test"
lrwx------ 1 postgres postgres 64 Mar 5 12:56 /proc/8221/fd/54 ->
/tmp/blah/base/5/16388 // "pg_toast_16384_index"
lrwx------ 1 postgres postgres 64 Mar 5 12:56 /proc/8221/fd/55 ->
/tmp/blah/base/5/16387 // "pg_toast_16384"

I've got for the following data:
- 83 pread64 and 83x fadvise() for random offsets for fd=45 - the main
intent of this patch (main relation heap prefetching), works good
- 54 pread64 calls for fd=54 (no favdises())
- 1789 (!) calls to pread64 for fd=55 for RANDOM offsets (TOAST heap,
no prefetch)

so at least in theory it makes a lot of sense to prefetch TOAST too,
pattern looks like cyclic random:

// pread(fd, "", blocksz, offset)
fadvise64(45, 40960, 8192, POSIX_FADV_WILLNEED) = 0
pread64(55, ""..., 8192, 38002688) = 8192
pread64(55, ""..., 8192, 12034048) = 8192
pread64(55, ""..., 8192, 36560896) = 8192
pread64(55, ""..., 8192, 8871936) = 8192
pread64(55, ""..., 8192, 17965056) = 8192
pread64(55, ""..., 8192, 18710528) = 8192
pread64(55, ""..., 8192, 35635200) = 8192
pread64(55, ""..., 8192, 23379968) = 8192
pread64(55, ""..., 8192, 25141248) = 8192
pread64(55, ""..., 8192, 3457024) = 8192
pread64(55, ""..., 8192, 24633344) = 8192
pread64(55, ""..., 8192, 36462592) = 8192
pread64(55, ""..., 8192, 18120704) = 8192
pread64(55, ""..., 8192, 27066368) = 8192
pread64(45, ""..., 8192, 40960) = 8192
pread64(55, ""..., 8192, 2768896) = 8192
pread64(55, ""..., 8192, 10846208) = 8192
pread64(55, ""..., 8192, 30179328) = 8192
pread64(55, ""..., 8192, 7700480) = 8192
pread64(55, ""..., 8192, 38846464) = 8192
pread64(55, ""..., 8192, 1040384) = 8192
pread64(55, ""..., 8192, 10985472) = 8192

It's probably a separate feature (prefetching blocks from TOAST), but
it could be mentioned that this patch is not doing that (I was
assuming it could).

> > 3. I'm not sure if I got good-enough results for DESCending index
> > `create index on test (id DESC);`- with eic=16 it doesnt seem to be
> > be able prefetch 16 blocks in advance? (e.g. highlight offset 557056
> > below in some text editor and it's distance is far lower between that
> > fadvise<->pread):
> >
[..]
> >
>
> I'm not sure I understand these strace snippets. Can you elaborate a
> bit, explain what the strace log says?

set enable_seqscan to off;
set enable_bitmapscan to off;
drop index test_id_idx;
create index on test (id DESC); -- DESC one
SELECT sum(val) FROM test WHERE id BETWEEN 10 AND 2000;

Ok, so cleaner output of strace -s 0 for PID doing that SELECT with
eic=16, annotated with [*]:

lseek(45, 0, SEEK_END) = 688128
lseek(47, 0, SEEK_END) = 212992
pread64(47, ""..., 8192, 172032) = 8192
pread64(45, ""..., 8192, 90112) = 8192
fadvise64(45, 172032, 8192, POSIX_FADV_WILLNEED) = 0
pread64(45, ""..., 8192, 172032) = 8192
fadvise64(45, 319488, 8192, POSIX_FADV_WILLNEED) = 0 [*off 319488 start]
fadvise64(45, 335872, 8192, POSIX_FADV_WILLNEED) = 0
pread64(45, ""..., 8192, 319488) = 8192 [*off 319488,
read, distance=1 fadvises]
fadvise64(45, 466944, 8192, POSIX_FADV_WILLNEED) = 0
fadvise64(45, 393216, 8192, POSIX_FADV_WILLNEED) = 0
pread64(45, ""..., 8192, 335872) = 8192
fadvise64(45, 540672, 8192, POSIX_FADV_WILLNEED) = 0 [*off 540672 start]
fadvise64(45, 262144, 8192, POSIX_FADV_WILLNEED) = 0
pread64(45, ""..., 8192, 466944) = 8192
fadvise64(45, 491520, 8192, POSIX_FADV_WILLNEED) = 0
pread64(45, ""..., 8192, 393216) = 8192
fadvise64(45, 163840, 8192, POSIX_FADV_WILLNEED) = 0
fadvise64(45, 385024, 8192, POSIX_FADV_WILLNEED) = 0
pread64(45, ""..., 8192, 540672) = 8192 [*off 540672,
read, distance=4 fadvises]
fadvise64(45, 417792, 8192, POSIX_FADV_WILLNEED) = 0
[..]
I was wondering why the distance never got >4 in such case for eic=16,
it should spawn more fadvises calls, shouldn't it? (it was happening
only for DESC, in normal ASC index the prefetching distance easily
achieves ~~ eic values) and I think today i've got the answer -- after
dropping/creating DESC index I did NOT execute ANALYZE so probably the
Min(..., plan_rows) was kicking in and preventing the full
prefetching.

Hitting above, makes me think that the XXX for plan_rows , should
really be real-FIXME.

-J.

[1] - https://www.postgresql.org/message-id/CAKZiRmznOwi0oaV%3D4PHOCM4ygcH4MgSvt8%3D5cu_vNCfc8FSUug%40mail.gmail.com

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Yugo NAGATA 2024-03-05 13:05:52 Remove unnecessary code from psql's watch command
Previous Message Andy Fan 2024-03-05 12:56:29 Re: a wrong index choose when statistics is out of date