Re: Proposed Patch to Improve Performance of Multi-Batch Hash Join for Skewed Data Sets

From: "Lawrence, Ramon" <ramon(dot)lawrence(at)ubc(dot)ca>
To: "Robert Haas" <robertmhaas(at)gmail(dot)com>
Cc: "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>, <pgsql-hackers(at)postgresql(dot)org>, "Bryce Cutt" <pandasuit(at)gmail(dot)com>
Subject: Re: Proposed Patch to Improve Performance of Multi-Batch Hash Join for Skewed Data Sets
Date: 2008-12-18 05:39:16
Message-ID: 6EEA43D22289484890D119821101B1DF2C180E@exchange20.mercury.ad.ubc.ca
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Robert,

You do not need to use qgen.exe to generate queries as you are not
running the TPC-H benchmark test. Attached is an example of the 22
sample TPC-H queries according to the benchmark.

We have not tested using the TPC-H queries for this particular patch and
only use the TPC-H database as a large, skewed data set. The simpler
queries we test involve joins of Part-Lineitem or Supplier-Lineitem such
as:

Select * from part, lineitem where p_partkey = l_partkey

OR

Select count(*) from part, lineitem where p_partkey = l_partkey

The count(*) version is usually more useful for comparisons as the
generation of output tuples on the client side (say with pgadmin)
dominates the actual time to complete the query.

To isolate query costs, we also test using a simple server-side
function. The setup description I have also attached.

I would be happy to help in any way I can.

Bryce is currently working on an updated patch according to your
suggestions.

--
Dr. Ramon Lawrence
Assistant Professor, Department of Computer Science, University of
British Columbia Okanagan
E-mail: ramon(dot)lawrence(at)ubc(dot)ca

> -----Original Message-----
> From: pgsql-hackers-owner(at)postgresql(dot)org [mailto:pgsql-hackers-
> owner(at)postgresql(dot)org] On Behalf Of Robert Haas
> Sent: December 17, 2008 7:54 PM
> To: Lawrence, Ramon
> Cc: Tom Lane; pgsql-hackers(at)postgresql(dot)org; Bryce Cutt
> Subject: Re: [HACKERS] Proposed Patch to Improve Performance of Multi-
> Batch Hash Join for Skewed Data Sets
>
> Dr. Lawrence:
>
> I'm still working on reviewing this patch. I've managed to load the
> sample TPCH data from tpch1g1z.zip after changing the line endings to
> UNIX-style and chopping off the trailing vertical bars. (If anyone is
> interested, I have the results of pg_dump | bzip2 -9 on the resulting
> database, which I would be happy to upload if someone has server
> space. It is about 250MB.)
>
> But, I'm not sure quite what to do in terms of generating queries.
> TPCHSkew contains QGEN.EXE, but that seems to require that you provide
> template queries as input, and I'm not sure where to get the
> templates.
>
> Any suggestions?
>
> Thanks,
>
> ...Robert
>
> --
> Sent via pgsql-hackers mailing list (pgsql-hackers(at)postgresql(dot)org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers

Attachment Content-Type Size
test_queries.txt text/plain 13.1 KB
setup.txt text/plain 1.8 KB

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Heikki Linnakangas 2008-12-18 06:29:58 Re: Preventing index scans for non-recoverable index AMs
Previous Message David Fetter 2008-12-18 05:38:26 Re: Partitioning wiki page