Optimizing Nested Correlated Queries by decorrelation: GSOC 2012 Project

From: Arun Chaitanya <chaitan64arun(at)gmail(dot)com>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Optimizing Nested Correlated Queries by decorrelation: GSOC 2012 Project
Date: 2012-03-30 11:33:48
Message-ID: CAFDHfCSjsoTZ-AQR-e7QwxwUXnCQzO_hNS-vZwQkR7pA31L3FA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

I wanted to take up this as a GSOC 2012 project.

SQL supports nested queries. When the inner query contains a
correlation variable the present optimizer takes an iterative
execution plan. If the inner query scans over a relation, the
iterative plan chosen can be sub-optimal.

The goal of this project is to enable De-correlation for all possible
cases. The iterative execution plan can be converted to a set oriented
plan by taking a join over the base relations involved. Sufficient
work has already been done in this area and has been implemented in
SQL Server.

The changes required to incorporate the above mentioned strategy is in
rewriting phase to the best of my knowledge. The key idea is to
introduce the APPLY operator in the raw parse tree. In the above
mentioned Papers, the author has mentioned the process of removing the
apply. The author has proposed a set of rules which will allow us to
achieve the goal. The present postgresql optimizer community has done
some work in these lines for simple subqueries involving =, > , < in
the predicates [ I observed it by seeing the result of EXPLAIN for
relevant queries ]. The optimization is not done for subqueries
containing aggregate queries and existential and containment queries.

An example query from TPCH benchmark discussed by the author:

select c_custkey
from customer
where 1000000 < (select sum(o_totalprice)
from orders
where o_custkey = c_custkey)

In the above case, c_custkey is a correlation variable (parameterized)
coming from the outer query. Hence in the present system, the inner
query is executed as many times as the tuples in the customer
relation. As the subQuery involves a scan over orders relation, the
total I/O cost involved is pretty high.

Using the transformation proposed by the author, the query can be re-written as

select c_custkey
from customer left outer join
orders on o_custkey = c custkey
group by c_custkey
having 1000000 < sum(o_totalprice)

This allows the optimizer to chose a better plan for left outer join
and avoid iterative execution. The idea here is not to get a rewritten
query as output but to generate a plan tree that does the same as the
above query.

Regards,
Arun

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Robert Haas 2012-03-30 12:04:16 Re: Optimizing Nested Correlated Queries by decorrelation: GSOC 2012 Project
Previous Message Dimitri Fontaine 2012-03-30 10:02:26 Re: Finer Extension dependencies