Re: Proposed Patch to Improve Performance of Multi-Batch Hash Join for Skewed Data Sets

From: "Bryce Cutt" <pandasuit(at)gmail(dot)com>
To: "Joshua Tolley" <eggyknap(at)gmail(dot)com>
Cc: "Robert Haas" <robertmhaas(at)gmail(dot)com>, "Lawrence, Ramon" <ramon(dot)lawrence(at)ubc(dot)ca>, "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Proposed Patch to Improve Performance of Multi-Batch Hash Join for Skewed Data Sets
Date: 2008-12-23 07:21:22
Message-ID: 1924d1180812222321t6ea99444xc01feaee47b26326@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Because there is no nice way in PostgreSQL (that I know of) to derive
a histogram after a join (on an intermediate result) currently
usingMostCommonValues is only enabled on a join when the outer (probe)
side is a table scan (seq scan only actually). See
getMostCommonValues (soon to be called
ExecHashJoinGetMostCommonValues) for the logic that determines this.

Here is the result of explain (on a 100MB version of PostgreSQL):
"Hash Left Join (cost=16232.00..91035.00 rows=600000 width=526)"
" Hash Cond: (l.l_partkey = p.p_partkey)"
" -> Hash Left Join (cost=15368.00..75171.00 rows=600000 width=395)"
" Hash Cond: (l.l_orderkey = o.o_orderkey)"
" -> Seq Scan on lineitem l (cost=0.00..17867.00 rows=600000
width=125)"
" -> Hash (cost=8073.00..8073.00 rows=150000 width=270)"
" -> Hash Left Join (cost=700.50..8073.00 rows=150000 width=270)"
" Hash Cond: (o.o_custkey = c.c_custkey)"
" -> Seq Scan on orders o (cost=0.00..4185.00
rows=150000 width=109)"
" -> Hash (cost=513.00..513.00 rows=15000 width=161)"
" -> Seq Scan on customer c
(cost=0.00..513.00 rows=15000 width=161)"
" -> Hash (cost=614.00..614.00 rows=20000 width=131)"
" -> Seq Scan on part p (cost=0.00..614.00 rows=20000 width=131)"

If you take a look at the explain for that join you will see that the
first of the relations joined are orders and customer on custkey.
There is almost no skew in the o_custkey attribute of orders even in
the Z2 dataset so the difference between hashjoin with and without
usingMostCommonValues enabled is quite small.

The second join performed is to join the result of the first join with
lineitem on orderkey. There is no skew at all in the l_orderkey
attribute of lineitem so usingMostCommonValues can not help at all.

The third join performed is to join the result of the second join with
part on partkey. There is a lot of skew in the l_partkey attribute of
lineitem but because the probe side of the third join is an
intermediate from the second join and not a seq scan the algorithm
cannot figure out the MCVs of the probe side.

So on the query presented almost no skew can be exploited on the first
join and no other joins can have their skew exploited at all because
of the order PostgreSQL does the joins in. Basically yes, you would
not see any real benefit from using the most common values on this
query.

We experimented with sampling (mentioned in the paper) to make an
educated guess of MCVs on intermediary results but found that because
a random sample could not be obtained the results were always very
inaccurate. I basically just read a percentage of tuples from the
probe relation before partitioning the build relation, derived the
MCVs in a single pass, wrote the tuples back out to a temp file
(because reading back from here is less expensive than resetting the
probe side tree), then did the join as usual while remembering to read
back from my temp file before reading the rest of the probe side
tuples. Our tests indicate that sampling is not likely a good
solution for deriving MCVs from intermediary results.

In the Java implementation of histojoin we experimented with
exploiting skew in multiple joins of a star join with some success
(detailed in paper). I am not sure how this would be accomplished
nicely in PostgreSQL.

If the cost operators knew how to order the joins to make the best use
of skew in the relations PostgreSQL could use the benefits of
histojoin more often if perhaps doing a join with skew first would
have speed benefits over doing the smaller join first. This change
could be a future addition to PostgreSQL if this patch is accepted.
It relies on getting the stats tuple for the join during the planning
phase (in the cost function) and estimating the benefit that would
have on the join cost.

- Bryce Cutt

On Mon, Dec 22, 2008 at 6:15 AM, Joshua Tolley <eggyknap(at)gmail(dot)com> wrote:
> On Sun, Dec 21, 2008 at 10:25:59PM -0500, Robert Haas wrote:
>> [Some performance testing.]
>
> I (finally!) have a chance to post my performance testing results... my
> apologies for the really long delay. <Excuses omitted>
>
> Unfortunately I'm not seeing wonderful speedups with the particular
> queries I did in this case. I generated three 1GB datasets, with skews
> set at 1, 2, and 3. The test script I wrote turns on enable_usestatmcvs
> and runs EXPLAIN ANALYZE on the same query five times. Then it turns
> enable_usestatmcvs off, and runs the same query five more times. It does
> this with each of the three datasets in turn, and then starts over at
> the beginning until I tell it to quit. My results showed a statistically
> significant improvement in speed only on the skew == 3 dataset.
>
> I did the same tests twice, once with default_statistics_target set to
> 10, and once with it set to 100. I've attached boxplots of the total
> query times as reported by EXPLAIN ANALYZE ("dst10" in the filename
> indicates default_statistics_target was 10, and so on), my results
> parsed out of the EXPLAIN ANALYZE output (test.filtered.10 and
> test.filtered.100), the results of one-tailed Student's T tests of the
> result set (ttests), and the R code to run the tests if anyone's really
> interested (t.test.R).
>
> The results data includes six columns: the skew value, whether
> enable_usestatmcvs was on or not (represented by a 1 or 0), total times
> for each of the three joins that made up the query, and total time for
> the query itself. The results above pay attention only to the total
> query time.
>
> Finally, the query involved:
>
> SELECT * FROM lineitem l LEFT JOIN part p ON (p.p_partkey = l.l_partkey)
> LEFT JOIN orders o ON (o.o_orderkey = l.l_orderky) LEFT JOIN customer c
> ON (c.c_custkey = o.o_custkey);
>
> - Josh / eggyknap
>
> -----BEGIN PGP SIGNATURE-----
> Version: GnuPG v1.4.9 (GNU/Linux)
>
> iEYEARECAAYFAklPoRYACgkQRiRfCGf1UMNUJgCcCxCRNXJS65nXqMsY2h6PENKF
> YkQAoJlSlaaHd2L5dkFUAc8GPKfKezS5
> =KWfi
> -----END PGP SIGNATURE-----
>
>

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Simon Riggs 2008-12-23 08:21:15 Re: Hot standby and b-tree killed items
Previous Message Emmanuel Cecchet 2008-12-23 06:09:45 Re: Sync Rep: Second thoughts