Re: Proposed Patch to Improve Performance of Multi-BatchHash Join for Skewed Data Sets

From: "Lawrence, Ramon" <ramon(dot)lawrence(at)ubc(dot)ca>
To: "Robert Haas" <robertmhaas(at)gmail(dot)com>, "Bryce Cutt" <pandasuit(at)gmail(dot)com>
Cc: "Joshua Tolley" <eggyknap(at)gmail(dot)com>, "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>, <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Proposed Patch to Improve Performance of Multi-BatchHash Join for Skewed Data Sets
Date: 2009-01-14 19:36:22
Message-ID: 6EEA43D22289484890D119821101B1DF2C1885@exchange20.mercury.ad.ubc.ca
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers


> Here is a cleaned-up version. I fixed a number of whitespace issues,
> improved a few comments, and rearranged one set of nested if-else
> statements (hopefully without breaking anything in the process).
>
> Josh / eggyknap -
>
> Can you rerun your performance tests with this version of the patch?

To help with testing, we have constructed a patch specifically for
testing. The patch is the same as Robert's version except that it
tracks and prints out statistics during the join on how many tuples are
affected and has the enable_hashjoin_usestatmcvs variable defined so
that it is easy to turn on/off skew handling. This is useful as
although the patch reduces the number of I/Os performed, this
improvement may not be seen in some queries which are dominated by other
cost factors (non-skew joins, CPU time, time to scan input relations,
etc.).

The sample output looks like this:

LI-P
Values: 100 Skew: 0.27 Est. tuples: 59986052.00 Batches: 512 Est.
Save: 16114709.99
Total Inner Tuples: 2000000
IM Inner Tuples: 83
Batch Zero Inner Tuples: 3941
Batch Zero Potential Inner Tuples: 3941
Total Outer Tuples: 59986052
IM Outer Tuples: 16074146
Batch Zero Outer Tuples: 98778
Batch Zero Potential Outer Tuples: 98778
Total Output Tuples: 59986052
IM Output Tuples: 16074146
Batch Zero Output Tuples: 98778
Batch Zero Potential Output Tuples: 98778
Percentage less tuple IOs than HHJ: 25.98

The other change is that the system calculates the skew and will not use
the in-memory skew partition if the skew is less than 1%.

Finally, we have attached some performance results for the TPCH 10G data
set (skew factors z=1 and z=2). For the Customer-Orders-Lineitem-Part
query that Josh was testing, we see no overall time difference that is
significant compared to experimental error (although there is I/O
benefit for the Lineitem-Part join). This query cost is dominated by
the non-skew joins of Customer-Orders and Orders-Lineitem and output
tuple construction.

The joins with skew, Lineitem-Supplier and Lineitem-Part, show
significantly improved performance. Note how the statistics show that
the percentage I/O savings is directly proportional to the skew.
However, the overall query time savings is always less than this as
there are other costs such as reading the relations, performing the hash
comparisons, building the output tuples, etc. that are unaffected by the
optimization.

At this point, we await further feedback on what is necessary to get
this patch accepted. We would also like to thank Josh and Robert again
for their review time.

Sincerely,

Ramon Lawrence and Bryce Cutt

Attachment Content-Type Size
histojoin_testing.patch application/octet-stream 34.5 KB
tpch10g1zResults.txt text/plain 2.1 KB
tpch10g2zResults.txt text/plain 2.7 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Bruce Momjian 2009-01-14 21:59:19 pgsql: Make 'find' syntax consistent; add .git exclusion to make_ctags.
Previous Message Martijn van Oosterhout 2009-01-14 19:14:15 Re: Updates of SE-PostgreSQL 8.4devel patches (r1403)