Re: [GENERAL] speeding up INTERSECT/EXCEPT

From: Mike Mascari <mascarim(at)yahoo(dot)com>
To: Carl Hauser <chauser(at)parc(dot)xerox(dot)com>
Cc: pgsql-general(at)postgresql(dot)org
Subject: Re: [GENERAL] speeding up INTERSECT/EXCEPT
Date: 1999-07-25 22:22:49
Message-ID: 19990725222249.18643.rocketmail@web125.yahoomail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

--- Carl Hauser <chauser(at)parc(dot)xerox(dot)com> wrote:
> I've searched the archives and docs without finding
> any help on this.
>
> I want to use:
>
> select * from rel1 except select * from rel2;
>
> where rel1 and and rel2 each currently have about
> 2000 records and are likely to grow to twice that
> size or more.
>
> The query works but takes inordinately long. Looking
> at the query plan, it seems that the statement is
> being executed using a nested scan, so the query is
> taking time proportional to the product of the sizes
> of the two relations. The result set will usually be
> a few hundred records.
>
> My question: are there any indices I could define on
> the relations that would speed up the query? Since
> noting the problem, I've added an index on one of
> the fields (the same for both relations). Another
> index covers 7 of the fields in both relations (7
> because that is the maximum for an index). Creating
> these indices made no difference in the query plan
> nor in the execution time of the query.
>
> Should I pursue this approach further or do the
> differencing outside of the database?
>
> -- Carl Hauser
> -- Xerox Palo Alto Research Center
>
>

If there is a primary key on each table that could
be used as the oprands in a comparison operator,
then I think you might find a correlated subquery
significantly faster than EXCEPT using the EXISTS
clause. For example, I have two tables, supplies and
deletedsupplies, with the primary key being the supply
field:

SELECT * FROM supplies
WHERE NOT EXISTS (SELECT deletedsupplies.supply FROM
deletedsupplies WHERE deletedsupplies.supply =
supplies.supply)

Because I have an index on the deletedsupplies'
"supply" field, the plan performs a Sequential
scan on the supplies table, and, for each record,
performs an Index scan for the corresponding
record in the deletedsupplies table.

So, for your example, the query might be:

SELECT * FROM rel1
WHERE NOT EXISTS (SELECT * FROM rel2 WHERE rel2.key =
re1l.key)

with indexes on rel1.key and rel2.key. If the ENTIRE
record must match before it is excluded then you must
use a multi-field index and write the query like:

CREATE INDEX k_rel2 ON rel2(field1,field2,...)

SELECT * FROM rel1
WHERE NOT EXISTS (SELECT * FROM rel2 WHERE
(rel2.field1,rel2.field2,...) =
(rel1.field1,rel1.field2,...))

I have found that the EXISTS implementation is usually
good enough to yield the same functionality (with
superior performance) as intersect, except, and as
a poor-man's substitue for outer-joins...

Hope that helps,

Mike Mascari (mascarim(at)yahoo(dot)com)

_________________________________________________________
Do You Yahoo!?
Get your free @yahoo.com address at http://mail.yahoo.com

Browse pgsql-general by date

  From Date Subject
Next Message Dan Wilson 1999-07-26 02:30:53 Re: [GENERAL] pg-dump -- primary Key
Previous Message Mike Mascari 1999-07-25 15:27:32 Re: [GENERAL] Installation of postgresql-6.5.1 data missing ?