Re: Proposed Patch to Improve Performance of Multi-Batch Hash Join for Skewed Data Sets

From: "Lawrence, Ramon" <ramon(dot)lawrence(at)ubc(dot)ca>
To: "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: <pgsql-hackers(at)postgresql(dot)org>, "Bryce Cutt" <pandasuit(at)gmail(dot)com>
Subject: Re: Proposed Patch to Improve Performance of Multi-Batch Hash Join for Skewed Data Sets
Date: 2008-11-24 16:16:57
Message-ID: 6EEA43D22289484890D119821101B1DF2C1751@exchange20.mercury.ad.ubc.ca
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

> -----Original Message-----
> From: Tom Lane [mailto:tgl(at)sss(dot)pgh(dot)pa(dot)us]
> I'm a tad worried about what happens when the values that are
frequently
> occurring in the outer relation are also frequently occurring in the
> inner (which hardly seems an improbable case). Don't you stand a
severe
> risk of blowing out the in-memory hash table? It doesn't appear to me
> that the code has any way to back off once it's decided that a certain
> set of join key values are to be treated in-memory. Splitting the
main
> join into more batches certainly doesn't help with that.
>
> Also, AFAICS the benefit of this patch comes entirely from avoiding
dump
> and reload of tuples bearing the most common values, which means it's
a
> significant waste of cycles when there's only one batch. It'd be
better
> to avoid doing any of the extra work in the single-batch case.
>
> One thought that might address that point as well as the difficulty of
> getting stats in nontrivial cases is to wait until we've overrun
memory
> and are forced to start batching, and at that point determine
on-the-fly
> which are the most common hash values from inspection of the hash
table
> as we dump it out. This would amount to optimizing on the basis of
> frequency in the *inner* relation not the outer, but offhand I don't
see
> any strong theoretical basis why that wouldn't be just as good. It
> could lose if the first work_mem worth of inner tuples isn't
> representative of what follows; but this hardly seems more dangerous
> than depending on MCV stats that are for the whole outer relation
rather
> than the portion of it being selected.
>
> regards, tom lane

You are correct with both observations. The patch only has a benefit
when there is more than one batch. Also, there is a potential issue
with MCV hash table overflows if the number of tuples that match the
MCVs in the build relation is very large.

Bryce has created a patch (attached) that disables the code for one
batch joins. This patch also checks for MCV hash table overflows and
handles them by "flushing" from the MCV hash table back to the main hash
table. The main hash table will then resolve overflows as usual. Note
that this will cause the worse case of a build table with all the same
values to be handled the same as the current hash code, i.e., it will
attempt to re-partition until it eventually gives up and then allocates
the entire partition in memory. There may be a better way to handle
this case, but the new patch will remain consistent with the current
hash join implementation.

The issue with determining and using the MCV stats is more challenging
than it appears. First, knowing the MCVs of the build table will not
help us. What we need are the MCVs of the probe table because by
knowing those values we will keep the tuples with those values in the
build relation in memory. For example, consider a join between tables
Part and LineItem. Assume 1 popular part accounts for 10% of all
LineItems. If Part is the build relation and LineItem is the probe
relation, then by keeping that 1 part record in memory, we will
guarantee that we do not need to write out 10% of LineItem. If a
selection occurs on LineItem before the join, it may change the
distribution of LineItem (the MCVs) but it is probable that they are
still a good estimate of the MCVs in the derived LineItem relation. (We
did experiments on trying to sample the first few thousand tuples of the
probe relation to dynamically determine the MCVs but generally found
this was inaccurate due to non-random samples.) In essence, the goal is
to smartly pick the tuples that remain in the in-memory batch before
probing begins. Since the number of MCVs is small, incorrectly
selecting build tuples to remain in memory has negligible cost.

If we assume that LineItem has been filtered so much that it is now
smaller than Part and is the build relation then the MCV approach does
not apply. There is no skew in Part on partkey (since it is the PK) and
knowing the MCV partkeys in LineItem does not help us because they each
only join with a single tuple in Part. In this case, the MCV approach
should not be used because no benefit is possible, and it will not be
used because there will be no MCVs for Part.partkey.

The bad case with MCV hash table overflow requires a many-to-many join
between the two relations which would not occur on the more typical
PK-FK joins.

--
Dr. Ramon Lawrence
Assistant Professor, Department of Computer Science, University of
British Columbia Okanagan
E-mail: ramon(dot)lawrence(at)ubc(dot)ca

Attachment Content-Type Size
histojoin_v3.patch application/octet-stream 21.0 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Gregory Stark 2008-11-24 16:23:42 Re: Visibility map, partial vacuums
Previous Message Tom Lane 2008-11-24 16:07:32 Re: Snapshot warning