Re: Proposed Patch to Improve Performance of Multi-BatchHash Join for Skewed Data Sets

From: Bryce Cutt <pandasuit(at)gmail(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Joshua Tolley <eggyknap(at)gmail(dot)com>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, "Lawrence, Ramon" <ramon(dot)lawrence(at)ubc(dot)ca>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Proposed Patch to Improve Performance of Multi-BatchHash Join for Skewed Data Sets
Date: 2009-03-21 02:22:20
Message-ID: 1924d1180903201922i37ee33f3idb060f631d58a322@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

The estimation functions assume the inner relation join column is
unique. But it freezes (flushes back to the main hash table) one skew
bucket at a time in order of least importance so if 100 inner tuples
can fit in the skew buckets then the skew buckets are only fully blown
out if the best tuple (the single most common value) occurs more than
100 times in the inner relation. And up until that point you still
have the tuples in memory that are the best "per tuple worth of
memory". But yes, after that point (more than 100 tuples of that best
MCV) the entire effort was wasted. The skew buckets are dynamically
flushed just like buckets in a dynamic hash join would be.

- Bryce Cutt

On Fri, Mar 20, 2009 at 5:51 PM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> On Fri, Mar 20, 2009 at 8:45 PM, Bryce Cutt <pandasuit(at)gmail(dot)com> wrote:
>> On Fri, Mar 20, 2009 at 5:35 PM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
>>> If the inner relation isn't fairly close to unique you shouldn't be
>>> using this optimization in the first place.
>> Not necessarily true.  Seeing as (when the statistics are correct) we
>> know each of these inner tuples will match with the largest amount of
>> outer tuples it is just as much of a win per inner tuple as when they
>> are unique.  There is just a chance you will have to give up on the
>> optimization part way through if too many inner tuples fall into the
>> new "skew buckets" (formerly IM buckets) and dump the tuples back into
>> the main buckets.  The potential win is still pretty high though.
>>
>> - Bryce Cutt
>
> Maybe I'm remembering wrong, but I thought the estimating functions
> assuemd that the inner relation was unique.  So if there turn out to
> be 2, 3, 4, or more copies of each value, the chances of blowing out
> the skew hash table are almost 100%, I would think...  am I wrong?
>
> ...Robert
>

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Robert Haas 2009-03-21 02:42:29 Re: contrib function naming, and upgrade issues
Previous Message Andrew Gierth 2009-03-21 01:57:05 contrib function naming, and upgrade issues