Re: Proposed Patch to Improve Performance of Multi-Batch Hash Join for Skewed Data Sets

From: Joshua Tolley <eggyknap(at)gmail(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Bryce Cutt <pandasuit(at)gmail(dot)com>, "Lawrence, Ramon" <ramon(dot)lawrence(at)ubc(dot)ca>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Proposed Patch to Improve Performance of Multi-Batch Hash Join for Skewed Data Sets
Date: 2008-12-22 14:15:58
Message-ID: 20081222141550.GA5720@uber
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Sun, Dec 21, 2008 at 10:25:59PM -0500, Robert Haas wrote:
> [Some performance testing.]

I (finally!) have a chance to post my performance testing results... my
apologies for the really long delay. <Excuses omitted>

Unfortunately I'm not seeing wonderful speedups with the particular
queries I did in this case. I generated three 1GB datasets, with skews
set at 1, 2, and 3. The test script I wrote turns on enable_usestatmcvs
and runs EXPLAIN ANALYZE on the same query five times. Then it turns
enable_usestatmcvs off, and runs the same query five more times. It does
this with each of the three datasets in turn, and then starts over at
the beginning until I tell it to quit. My results showed a statistically
significant improvement in speed only on the skew == 3 dataset.

I did the same tests twice, once with default_statistics_target set to
10, and once with it set to 100. I've attached boxplots of the total
query times as reported by EXPLAIN ANALYZE ("dst10" in the filename
indicates default_statistics_target was 10, and so on), my results
parsed out of the EXPLAIN ANALYZE output (test.filtered.10 and
test.filtered.100), the results of one-tailed Student's T tests of the
result set (ttests), and the R code to run the tests if anyone's really
interested (t.test.R).

The results data includes six columns: the skew value, whether
enable_usestatmcvs was on or not (represented by a 1 or 0), total times
for each of the three joins that made up the query, and total time for
the query itself. The results above pay attention only to the total
query time.

Finally, the query involved:

SELECT * FROM lineitem l LEFT JOIN part p ON (p.p_partkey = l.l_partkey)
LEFT JOIN orders o ON (o.o_orderkey = l.l_orderky) LEFT JOIN customer c
ON (c.c_custkey = o.o_custkey);

- Josh / eggyknap

Attachment Content-Type Size
image/png 4.9 KB
image/png 5.0 KB
test.filtered.10 text/plain 4.9 KB
test.filtered.100 text/plain 3.3 KB
ttests text/plain 2.9 KB
t.test.R text/plain 1.7 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Jaime Casanova 2008-12-22 15:44:47 Re: reloptions and toast tables
Previous Message Jan Urbański 2008-12-22 13:55:00 Re: Lock conflict behavior?