Re: Proposed Patch to Improve Performance of Multi-BatchHash Join for Skewed Data Sets

From: Bryce Cutt <pandasuit(at)gmail(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Joshua Tolley <eggyknap(at)gmail(dot)com>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, "Lawrence, Ramon" <ramon(dot)lawrence(at)ubc(dot)ca>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Proposed Patch to Improve Performance of Multi-BatchHash Join for Skewed Data Sets
Date: 2009-03-21 00:45:14
Message-ID: 1924d1180903201745i7e3f6876w2e108571e3b1843d@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Not necessarily true. Seeing as (when the statistics are correct) we
know each of these inner tuples will match with the largest amount of
outer tuples it is just as much of a win per inner tuple as when they
are unique. There is just a chance you will have to give up on the
optimization part way through if too many inner tuples fall into the
new "skew buckets" (formerly IM buckets) and dump the tuples back into
the main buckets. The potential win is still pretty high though.

- Bryce Cutt

On Fri, Mar 20, 2009 at 5:35 PM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> On Fri, Mar 20, 2009 at 8:14 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>> Bryce Cutt <pandasuit(at)gmail(dot)com> writes:
>>> Here is the new patch.
>>
>> Applied with revisions.  I undid some of the "optimizations" that
>> cluttered the code in order to save a cycle or two per tuple --- as per
>> previous discussion, that's not what the performance questions were
>> about.  Also, I did not like the terminology "in-memory"/"IM"; it seemed
>> confusing since the main hash table is in-memory too.  I revised the
>> code to consistently refer to the additional hash table as a "skew"
>> hashtable and the optimization in general as skew optimization.  Hope
>> that seems reasonable to you --- we could search-and-replace it to
>> something else if you'd prefer.
>>
>> For the moment, I didn't really do anything about teaching the planner
>> to account for this optimization in its cost estimates.  The initial
>> estimate of the number of MCVs that will be specially treated seems to
>> me to be too high (it's only accurate if the inner relation is unique),
>> but getting a more accurate estimate seems pretty hard, and it's not
>> clear it's worth the trouble.  Without that, though, you can't tell
>> what fraction of outer tuples will get the short-circuit treatment.
>
> If the inner relation isn't fairly close to unique you shouldn't be
> using this optimization in the first place.
>
> ...Robert
>

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Robert Haas 2009-03-21 00:51:00 Re: Proposed Patch to Improve Performance of Multi-BatchHash Join for Skewed Data Sets
Previous Message Robert Haas 2009-03-21 00:35:53 Re: Proposed Patch to Improve Performance of Multi-BatchHash Join for Skewed Data Sets