The testing of multi-batch hash joins with skewed data sets patch

From: "David Rowley" <dgrowley(at)gmail(dot)com>
To: <pgsql-hackers(at)postgresql(dot)org>
Cc: <pandasuit(at)gmail(dot)com>, <ramon(dot)lawrence(at)ubc(dot)ca>
Subject: The testing of multi-batch hash joins with skewed data sets patch
Date: 2009-02-10 22:05:17
Message-ID: 0E046949417A446896F804BDC0EF4245@amd64
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

I've been putting a little bit of thought into how to go about testing the
performance of this patch. From reading the previous threads quite a bit of
testing was done with a certain data set where all that tested found it to
be a big winner with staggering performance gains with the skewed dataset.
Still the wiki page states that it needs performance testing. I'm guessing
what we really need to test now is ask: Are non skewed sets any slower now?
Where do we start seeing the gains?

So I've been working a little on a set of data that can be created simply
just be running a few SQLs. I've yet run the tests as I'm having some
hardware problem with my laptop. In the meantime I thought I'd share what I
was going to test with the community to see if I'm going about things the
right way.

The idea I came up with for benchmarking was a little similar to what I
remember from the original tests. I have a sales orders table and a products
table. My version of the sales orders table contains a customer column. Data
for 10 customers is populated into the sales orders table, customer 1 has a
totally non-skewed set of orders, where customer 10 has the most skew. I've
done this by creating 10000 products each with a product code that has been
cast into a varchar and padded up to 5 chars in length with '0's. Each
customer has the same number of rows in the sales orders table, customer 10
mostly orders products that when cast as INT are evenly divisible by 10,
where customer 2 mostly orders products that are evenly divisible by 2. You
get the idea.

Once I get this laptop sorted out or get access to some better hardware It
was my plan to benchmark and chart the results from customers 1 to 10 for
with and without the patch. What I hope to prove is that customer 1 is
almost the same for with as without the patch and hopefully see an even rise
in performance as the customer id number increases.

Currently I'm unsure the best way to ensure that the hash join goes into
more than one batch apart from just making the dataset very large.

Does anyone have any thoughts about the way I plan to go about benchmarking?

Please see the attached document for the benchmark script.

David.

Attachment Content-Type Size
mbhj_patch_tests.sql application/octet-stream 3.2 KB

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Merlin Moncure 2009-02-10 22:20:10 Re: PQinitSSL broken in some use casesf
Previous Message Tom Lane 2009-02-10 22:03:32 Re: Optimization rules for semi and anti joins