Re: Proposed Patch to Improve Performance of Multi-BatchHash Join for Skewed Data Sets

From: "Lawrence, Ramon" <ramon(dot)lawrence(at)ubc(dot)ca>
To: "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>, "Joshua Tolley" <eggyknap(at)gmail(dot)com>
Cc: "Robert Haas" <robertmhaas(at)gmail(dot)com>, "Heikki Linnakangas" <heikki(dot)linnakangas(at)enterprisedb(dot)com>, "Bryce Cutt" <pandasuit(at)gmail(dot)com>, <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Proposed Patch to Improve Performance of Multi-BatchHash Join for Skewed Data Sets
Date: 2009-02-26 16:52:37
Message-ID: 6EEA43D22289484890D119821101B1DF2C199B@exchange20.mercury.ad.ubc.ca
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

> From: Tom Lane
> Heikki's got a point here: the planner is aware that hashjoin doesn't
> like skewed distributions, and it assigns extra cost accordingly if it
> can determine that the join key is skewed. (See the "bucketsize"
stuff
> in cost_hashjoin.) If this patch is accepted we'll want to tweak that
> code.

Those modifications would make the optimizer more likely to select hash
join, even with skewed distributions. For the TPC-H data set that we
are using the optimizer always picks hash join over merge join (single
or multi-batch). Since the current patch does not change the cost
function, there is no change in the planning cost. It may or may not be
useful to modify the cost function depending on the effect on planning
cost.

> Still, that has little to do with the current gating issue, which is
> whether we've convinced ourselves that the patch doesn't cause a
> performance decrease for cases in which it's unable to help.

Although we have not seen an overhead when the optimization is
by-passed, we are looking at some small code changes that would
guarantee that no extra statements are executed for the single batch
case. Currently, an if optimization_on check is performed on each probe
tuple which, although minor, should be able to be avoided.

The patch's author, Bryce Cutt, is defending his Master's thesis Friday
morning (on this work), so we will provide some updated code right after
that. Since these code changes are small, they should not affect people
trying to test the performance of the current patch.

--
Ramon Lawrence

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Andrew Dunstan 2009-02-26 17:07:56 Re: xpath processing brain dead
Previous Message Simon Riggs 2009-02-26 16:48:59 Re: Synchronous replication & Hot standby patches