Re: Improving non-joinable EXISTS subqueries

From: "Kevin Grittner" <Kevin(dot)Grittner(at)wicourts(dot)gov>
To: <pgsql-hackers(at)postgreSQL(dot)org>,"Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Subject: Re: Improving non-joinable EXISTS subqueries
Date: 2008-08-19 14:43:30
Message-ID: 48AA95C2.EE98.0025.0@wicourts.gov
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

>>> Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:

> The examples that Kevin Grittner put up awhile back included several
> uses of EXISTS() in places where it couldn't be turned into a
semijoin,
> eg in the query's targetlist. I was musing a bit about whether we
could
> improve those scenarios. I would like to get 8.4 to the point where
we
> could say as a blanket performance recommendation "prefer EXISTS
over
> IN". The semantic gotchas associated with NOT IN make it hard to
> optimize well, not to mention being a perennial bane of novices; so
if
> we could just point people in the other direction without
qualification
> I think we'd be better off.

Agreed.

> So ... I'm wondering if this actually touches anyone's hot-button,
> or if we should just file it in the overflowing pile of Things That
> Might Be Nice To Do Someday.
>
> Comments?

I'm in the position of trying to influence programmers here to write
queries using set logic. Way too many of the queries here are coded
with a cursor for a "primary" select, with a bunch of lower level
cursors to navigate around and get the related rows one at a time.
Results are often stuck into a work table as this progresses, with the
work table massaged a bit here and there in this procedural process,
and the final results selected out. It should surprise nobody here
that this is not fast to write, easy to maintain, efficient to run, or
generally free from subtle errors. I point out that they should write
queries which state what they want, regardless of how complex those
rules are, instead of writing how to get it. The optimizer, I argue,
has tricks available which they don't.

Usually, a rewrite into set logic has a fraction of the number of
lines, runs much faster, and loses a bug or two that was hidden within
the procedural spaghetti. On the other hand, sometimes they write a
perfectly good "set logic" query (from the point of view of stating
what they want), and the optimizer falls down, and I have to come in
and say "Oh, it has trouble with EXISTS; you can use IN here." When I
tell them to use IN instead of EXISTS, then I need to have all these
caveats about the sizes of tables and the possibilities of NULL on the
NOT EXISTS. At this point I tend to lose a big part of my audience.

So I'd be very happy to see this work done, not because I can't find a
workaround, but because trying to teach all the programmers tricky
hand-optimizations is a losing battle, and if I lose that battle the
queries degenerate into spaghetti-land.

As with others, I can't say where this fits on a priority list, but I
would hate to see it drift off onto a "someday" list.

-Kevin

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2008-08-19 14:46:50 Re: Patch: plan invalidation vs stored procedures
Previous Message Tom Lane 2008-08-19 14:40:53 Re: Compatibility types, type aliases, and distinct types