Quick Links

Re: Planning for improved versions of IN/NOT IN

From:	Joe Conway <mail(at)joeconway(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	pgsql-hackers(at)postgreSQL(dot)org
Subject:	Re: Planning for improved versions of IN/NOT IN
Date:	2002-11-30 04:32:23
Message-ID:	3DE83F57.10801@joeconway.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

Tom Lane wrote:
> I've been thinking about how to improve the performance of queries using
> "WHERE x IN (subselect)" and "WHERE x NOT IN (subselect)".
>
> In the existing implementation, the subquery result is rescanned to look
> for a match to x each time the WHERE clause is executed; this essentially
> makes it work like a nestloop join of the stupidest variety. (We do
> stick a Materialize node atop the subselect if it looks complicated, but
> that's not a big help in typical cases.)
>
> I've thought of three alternative implementations that would perform
> better in various scenarios. Each would be relatively simple to
> implement; the problem I'm having is figuring out how to get the planner
> to choose the best one. The alternatives are basically:
>

[abbreviated]
> 1. Add FROM item
> 2. Hash-based
> 3. Inner indexscan
> 4. the existing implementation

> The difficulty is that it's not clear how to choose one of these four
> ways, short of planning the *entire* query from scratch all four ways :-(.
> This seems pretty grim. Approaches #2 and #3 could be handled as local
> transformations of the WHERE clause, but we couldn't choose which to use
> very well if we don't yet know how many outer rows the WHERE clause will
> be executed for. Approach #1 is really planning a completely different
> query --- one with an extra FROM-item --- and there's not going to be
> all that much commonality in the computations, unless we restrict where
> the added FROM-item can be joined to the others, which'd more or less
> defeat the purpose.
>
> Anyone see a way around this difficulty?

How about starting with a rule-based method to make the choice?

1. If uncorrelated: use hash-based approach - ISTM this might address a large
percentage of the problem cases -- it could even handle the
"IN (list-of-scalars)" case. Could it fall back to a
tuplesort/binary-search for the too many to hash in memory case?
2. If correlated: use an inner indexscan
3. If you come up with a pattern where none of the approaches produce a
correct answer, use the existing implementation

You could always get fancier later if needed, but something along these lines
would be a great start.

Joe

In response to

Planning for improved versions of IN/NOT IN at 2002-11-27 23:56:31 from Tom Lane

Responses

Re: Planning for improved versions of IN/NOT IN at 2002-11-30 04:42:29 from Mike Mascari

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Justin Clift	2002-11-30 04:32:54	Re: Postgres 7.3 announcement on postgresql.org
Previous Message	Tom Lane	2002-11-30 04:24:08	Re: Locale-dependent case conversion in {identifier}