From: | Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com> |
---|---|
To: | Pg Hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Experimenting with hash join prefetch |
Date: | 2018-10-14 04:18:19 |
Message-ID: | CAEepm=2y9HM9QP+HhRZdQ3pU6FShSMyu=V1uHXhQ5gG-dketHg@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
Hello hackers,
I have a list of micro-optimisations and things to look into for hash
joins, which I've updated on the wiki[1]. Here's one that I was
inspired to poke at with a stick in a spare hour today.
Cache-oblivious hash joins cause a lot of TLB and cache misses.
Researchers tell us to reduce those using huge/super VM pages[2] and
cache prefetch instructions[3]. (There is another class of
cache-aware hash join algorithms that partition carefully up front to
avoid them; that's not us.)
Here is a totally contrived experiment that shows the effect quite
reproducibly here:
shared_buffers = '1GB'
create table t as select generate_series(1, 20000000)::int i;
set max_parallel_workers_per_gather = 0;
set work_mem = '2GB';
select pg_prewarm('t'::regclass);
select count(*) from t t1 join t t2 using (i);
-> 00:12.683
First, let's try to prefetch the hash bucket for the next tuple while
computing the hash for the current tuple, since we can see into the
future quite easily: we know the keys are sequential integers in this
contrived experiment. In ExecHashGetHashValue():
+ /* Prefetch the bucket for the next key */
+ uint32 next_hash = hash_uint32(DatumGetInt32(keyval) + 1);
+ uint32 next_bucket = next_hash % hashtable->nbuckets;
+ __builtin_prefetch(&hashtable->buckets.unshared[next_bucket]);
select count(*) from t t1 join t t2 using (i);
-> 00:09.901
Huzzah! Next up, let's try a two-stage prefetch pipeline for the
probing phase, seeing two steps ahead:
+ /* Prefetch the bucket for the tuple after next */
+ uint32 next_next_hash = hash_uint32(DatumGetInt32(keyval) + 2);
+ uint32 next_next_bucket = next_next_hash % hashtable->nbuckets;
+
__builtin_prefetch(&hashtable->buckets.unshared[next_next_bucket]);
+ if (outer_tuple)
+ {
+ /* Prefetch the first tuple in the next bucket */
+ uint32 next_hash =
hash_uint32(DatumGetInt32(keyval) + 1);
+ uint32 next_bucket = next_hash % hashtable->nbuckets;
+
__builtin_prefetch(hashtable->buckets.unshared[next_bucket]);
+ }
-> 00:09.138
It's nice to see this effect, but it isn't really surprising: there is
no doubt about the value of prefetching random access data.
I think we could probably do this for the build phase with the
existing tuple-at-a-time executor interface by doing the bucket
insertions from a queue that runs N tuples behind the one we're
currently loading and hashing. Or something like that. For the probe
phase (probably much more interesting) I think it'd involve extra
tuple copying, so that it could still access the last tuple while
pulling the next tuple to hash its key. To avoid that we'd need a
facility for peeking at future tuples, or a proper first class batch
mode.
[1] https://wiki.postgresql.org/wiki/Parallel_Hash
[2] https://15721.courses.cs.cmu.edu/spring2016/papers/balkesen-icde2013.pdf
(table VI)
[3] http://www.cs.cmu.edu/~chensm/papers/hashjoin_icde04.pdf
--
Thomas Munro
http://www.enterprisedb.com
From | Date | Subject | |
---|---|---|---|
Next Message | Andrey Borodin | 2018-10-14 10:10:58 | Re: Experimenting with hash join prefetch |
Previous Message | Justin Pryzby | 2018-10-14 02:08:49 | Re: fine tune v11 release notes |