Improving non-joinable EXISTS subqueries

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: pgsql-hackers(at)postgreSQL(dot)org
Subject: Improving non-joinable EXISTS subqueries
Date: 2008-08-19 02:15:38
Message-ID: 10416.1219112138@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

The examples that Kevin Grittner put up awhile back included several
uses of EXISTS() in places where it couldn't be turned into a semijoin,
eg in the query's targetlist. I was musing a bit about whether we could
improve those scenarios. I would like to get 8.4 to the point where we
could say as a blanket performance recommendation "prefer EXISTS over
IN". The semantic gotchas associated with NOT IN make it hard to
optimize well, not to mention being a perennial bane of novices; so if
we could just point people in the other direction without qualification
I think we'd be better off. But how much work would it be to get there?

The single place where IN wins over EXISTS as of CVS HEAD is that a
non-join-optimizable IN clause can still be turned into a "hashed
subplan", which greatly reduces the cost of making IN tests for a large
number of upper-query rows. It looks to me that with the current
planning infrastructure it wouldn't be very difficult to turn EXISTS
(with hashable comparisons to upper variables in its WHERE) into a
similar kind of plan. The problem is that *that isn't necessarily a win*.
Consider something like

SELECT x, y, EXISTS(SELECT * FROM tab1 WHERE tab1.a = tab2.z)
FROM tab2 WHERE ...;

Given that there's an index on tab1.a, the current planning for this
will produce what's essentially a nestloop-with-inner-indexscan plan:
for each tab2 row selected by the outer query, we'll do an indexscan
probe into tab1 to see if there's a match. This is an ideal plan as
long as the outer query doesn't select very many tab2 rows.

We could transform this into the equivalent of a hashed implementation
of

SELECT x, y, z IN (SELECT a FROM tab1)
FROM tab2 WHERE ...;

which would result in loading all of tab1 into a hashtable and then
probing the hashtable for each tab2 row. Now, that wins big if there
are many selected tab2 rows (and tab1 isn't too big to fit in an
in-memory hashtable). But it loses big if the outer query only needs
to probe for a few values --- we won't repay the cost of building
the hashtable.

So I think it'd be a really bad idea to make this change blindly.
For everyone whose query got speeded up, someone else's would be slowed
down --- in fact, for apps that are tuned to PG's existing behavior,
you could expect that it'd mostly be the latter case.

The problem then is to make the choice of plan with some intelligence.
The bit of information that we lack in order to do that is an idea of
how many times the outer query will call the EXISTS subquery. Right
now, all subqueries that can't be pulled up as joins are planned fully
during SS_process_sublinks(), which is called long before we can have
any idea about that. I looked into whether it's feasible to postpone
planning subqueries till later on in planning. I think it's probably
structurally possible without an enormous amount of work, but it's not
exactly trivial either.

Even given that we postpone planning/costing subqueries until we really
need to know the cost, we're not out of the woods. For an EXISTS
appearing in a join condition, it's entirely possible that different
join sequences will result in executing the EXISTS wildly different
numbers of times. Re-planning the EXISTS subquery each time we consider
a different upper-query join sequence seems entirely impractical on
speed grounds.

So it seems like what we'd need to do is

* During planner startup, generate Paths (we'd need no more level of
detail) for both the "retail" and hashed version of each EXISTS
subquery. From these, estimate the startup cost of the hashed version
(ie, time to execute the un-qualified subquery once and load the hash
table) and the per-upper-row costs of each approach. Stash these costs
somewhere handy.

* While forming upper-query paths, estimate the costs of each approach
on-the-fly for every path, based on the estimated number of rows in the
input paths. Use the cheaper case while figuring the cost of that
upper path.

* While building the final Plan, instantiate whichever subquery Plan
is a winner in the context of the chosen upper path.

I don't think any of this is out of reach, but it'd be a nontrivial
bit of implementation effort (maybe a week or three) and it also looks
like there might be a measurable planning slowdown for any query
involving subqueries. I'm not sure yet how much of this is just moving
existing work around and how much will actually represent new planning
expense. But certainly, costing EXISTS subqueries two different ways
isn't going to be free.

So ... I'm wondering if this actually touches anyone's hot-button,
or if we should just file it in the overflowing pile of Things That
Might Be Nice To Do Someday.

Comments?

regards, tom lane

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message ITAGAKI Takahiro 2008-08-19 02:24:01 Re: pgbench duration option
Previous Message leiyonghua 2008-08-19 01:11:45 Re: Postgres-R