Potential Join Performance Issue

From: "Lawrence, Ramon" <ramon(dot)lawrence(at)ubc(dot)ca>
To: <pgsql-hackers(at)postgresql(dot)org>
Subject: Potential Join Performance Issue
Date: 2008-09-09 18:21:09
Message-ID: 6EEA43D22289484890D119821101B1DF2C1563@exchange20.mercury.ad.ubc.ca
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

PostgreSQL development community:

Our research group has been using the PostgreSQL code base to test new
join algorithms. During testing, we noticed that the planner is not
pushing down projections to the outer relation in a hash join. Although
this makes sense for in-memory (1 batch) joins, for joins larger than
memory (such as for TPC-H DSS), this causes the system to perform
significantly more disk I/Os when reading/writing batches of the outer
relation.

A simple solution is to add a single line of code to
src\backend\optimizer\plan\createplan.c after line 1771:

disuse_physical_tlist(outer_plan, best_path->jpath.outerjoinpath);

This will always force the projection on the outer relation.

A more complicated modification alternative is to add a state variable
to allow the planner to know how many batches the hash join expects and
only push down the projection if it is greater than one. However,
pushing the projection on the outer relation is almost always the best
choice as it eliminates unneeded attributes for operators above the hash
join in the plan and will be robust in the case of poor estimates.

We have been testing using TPC-H scale factor 1 GB. A sample query that
demonstrates the behavior is:

SELECT c_custkey, c_name, o_orderkey, o_orderdate

FROM Customer, Orders

WHERE c_custkey = o_custkey

Note that EXPLAIN on this query will indicate that the projection is
performed on the outer relation even though it is not done. We found
the difference by modifying our code to track tuples and bytes output to
disk, but it also can be detected by watching the size of the temporary
files produced during the join.

Sincerely,

Dr. Ramon Lawrence

Assistant Professor, Department of Computer Science, University of
British Columbia Okanagan

http://people.ok.ubc.ca/rlawrenc/

E-mail: ramon(dot)lawrence(at)ubc(dot)ca <mailto:ramon(dot)lawrence(at)ubc(dot)ca>

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Zeugswetter Andreas OSB sIT 2008-09-09 18:59:58 Re: Synchronous Log Shipping Replication
Previous Message Alvaro Herrera 2008-09-09 17:58:45 Re: pg_regress inputdir