Re: Experimenting with hash join prefetch

From: Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>
To: Dmitry Dolgov <9erthalion6(at)gmail(dot)com>
Cc: Andrey Borodin <x4mmm(at)yandex-team(dot)ru>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Experimenting with hash join prefetch
Date: 2018-10-14 12:33:00
Message-ID: CAEepm=23BB8kOzLASfpXzPShP8NLzC8t58uWeqCxL+_W7yutqw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Mon, Oct 15, 2018 at 12:16 AM Dmitry Dolgov <9erthalion6(at)gmail(dot)com> wrote:
> > On Sun, 14 Oct 2018 at 06:19, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com> wrote:
> > Cache-oblivious hash joins cause a lot of TLB and cache misses.
> > ...
> > (There is another class of cache-aware hash join algorithms that partition
> > carefully up front to avoid them; that's not us.)
>
> Just out of curiosity, can you please elaborate more on this part (with
> references)? I'm thinking about this topic for a while, and I'm wondering, if
> by another class you mean something like this [1], then even if it's not us
> today, are there any issues that prevent from experimenting in this area?

Hmm, I didn't mean the term-of-art "cache-oblivious" (Wikipedia
definition: "an algorithm designed to take advantage of a CPU cache
without having the size of the cache"), I meant not "cache-conscious"
(we don't do anything at all to reduce cache misses, though obviously
we could add prefetching to improve on that).

The distinction I'm making is between "no partition" hash join (what
we have), where you don't have to do any work up front, but you pay
for a lot of cache misses during building and probing, and "partition"
hash join, notably "radix join" (what MonetDB has?), where you have a
partition phase before the build phase that aims to break the data up
into enough partitions so that the hash tables will fit better in
cache, making the later phases faster. There seems to be some
disagreement about which is best -- passing through the data first is
expensive, but so are cache misses on every probe, and there are
claims that the winner depends on skew and tuple size.

Here's some of the stuff I read/watched about this subject:
https://15721.courses.cs.cmu.edu/spring2018/schedule.html#apr-04-2018
Add to that http://www.adms-conf.org/2017/camera-ready/Analyzing_In_Memory_Hash_Joins__Granularity_Matters.pdf.

Skimming very quickly through the paper you posted, yeah, I mean
exactly that stuff. Specifically I was thinking of the radix join
mentioned in background section 2.3. (I see also that the same
authors wrote a paper "Cache-Oblivious Hash Joins" which appears to
describe a refinement of radix that doesn't need to be parameterised
for cache size.) Sure, we could always consider these ideas. I am
not personally working on that; to me it looked very hard to build,
for a feature so uncertain to produce better results!

(Note: we do of course have some kind of partitioning called
"batching" when work_mem runs out, but it's not a kind of partitioning
that cares about reducing cache misses, so if I understood correctly
it's still "no partition" as far as this discussion goes.)

--
Thomas Munro
http://www.enterprisedb.com

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Michael Paquier 2018-10-14 13:26:24 pgsql: Avoid duplicate XIDs at recovery when building initial snapshot
Previous Message Dmitry Dolgov 2018-10-14 11:16:49 Re: Experimenting with hash join prefetch