Experimenting with hash join prefetch

From: Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>
To: Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Experimenting with hash join prefetch
Date: 2018-10-14 04:18:19
Message-ID: CAEepm=2y9HM9QP+HhRZdQ3pU6FShSMyu=V1uHXhQ5gG-dketHg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hello hackers,

I have a list of micro-optimisations and things to look into for hash
joins, which I've updated on the wiki[1]. Here's one that I was
inspired to poke at with a stick in a spare hour today.

Cache-oblivious hash joins cause a lot of TLB and cache misses.
Researchers tell us to reduce those using huge/super VM pages[2] and
cache prefetch instructions[3]. (There is another class of
cache-aware hash join algorithms that partition carefully up front to
avoid them; that's not us.)

Here is a totally contrived experiment that shows the effect quite
reproducibly here:

shared_buffers = '1GB'

create table t as select generate_series(1, 20000000)::int i;
set max_parallel_workers_per_gather = 0;
set work_mem = '2GB';
select pg_prewarm('t'::regclass);

select count(*) from t t1 join t t2 using (i);

-> 00:12.683

First, let's try to prefetch the hash bucket for the next tuple while
computing the hash for the current tuple, since we can see into the
future quite easily: we know the keys are sequential integers in this
contrived experiment. In ExecHashGetHashValue():

+ /* Prefetch the bucket for the next key */
+ uint32 next_hash = hash_uint32(DatumGetInt32(keyval) + 1);
+ uint32 next_bucket = next_hash % hashtable->nbuckets;
+ __builtin_prefetch(&hashtable->buckets.unshared[next_bucket]);

select count(*) from t t1 join t t2 using (i);

-> 00:09.901

Huzzah! Next up, let's try a two-stage prefetch pipeline for the
probing phase, seeing two steps ahead:

+ /* Prefetch the bucket for the tuple after next */
+ uint32 next_next_hash = hash_uint32(DatumGetInt32(keyval) + 2);
+ uint32 next_next_bucket = next_next_hash % hashtable->nbuckets;
+
__builtin_prefetch(&hashtable->buckets.unshared[next_next_bucket]);
+ if (outer_tuple)
+ {
+ /* Prefetch the first tuple in the next bucket */
+ uint32 next_hash =
hash_uint32(DatumGetInt32(keyval) + 1);
+ uint32 next_bucket = next_hash % hashtable->nbuckets;
+
__builtin_prefetch(hashtable->buckets.unshared[next_bucket]);
+ }

-> 00:09.138

It's nice to see this effect, but it isn't really surprising: there is
no doubt about the value of prefetching random access data.

I think we could probably do this for the build phase with the
existing tuple-at-a-time executor interface by doing the bucket
insertions from a queue that runs N tuples behind the one we're
currently loading and hashing. Or something like that. For the probe
phase (probably much more interesting) I think it'd involve extra
tuple copying, so that it could still access the last tuple while
pulling the next tuple to hash its key. To avoid that we'd need a
facility for peeking at future tuples, or a proper first class batch
mode.

[1] https://wiki.postgresql.org/wiki/Parallel_Hash
[2] https://15721.courses.cs.cmu.edu/spring2016/papers/balkesen-icde2013.pdf
(table VI)
[3] http://www.cs.cmu.edu/~chensm/papers/hashjoin_icde04.pdf

--
Thomas Munro
http://www.enterprisedb.com

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Andrey Borodin 2018-10-14 10:10:58 Re: Experimenting with hash join prefetch
Previous Message Justin Pryzby 2018-10-14 02:08:49 Re: fine tune v11 release notes