Re: EXCEPT Queries

From: Peter Childs <blue(dot)dragon(at)blueyonder(dot)co(dot)uk>
To:
Cc: pgsql-sql(at)postgresql(dot)org
Subject: Re: EXCEPT Queries
Date: 2003-02-20 08:29:17
Message-ID: Pine.LNX.4.44.0302200822130.5033-100000@RedDragon.Childs
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-sql

On Wed, 19 Feb 2003, Mark Mitchell wrote:

> I have a query that I must dedup using data in another table.
> This can be done quiet easily using an EXCEPT clause as long as both
> queries return the same result set, the problem with that is I only want
> to dedup based on a single columns value not dedup based on entire rows.
> The only other way I can see of doing this is using a NOT IN () clause,
> this seems to take much more time and resources than an EXCEPT
> statement. Is there any way to quickly dedup two tables based on only
> one row?
>
> Here is an example of what I'm currently doing.
>
> TABLE "A"
> "SUBSCRIBER_NAME" | "ACCOUNT_NUMBER"
> --------------------------------------
> BOB | 000001
> JOE | 000002
>
> TABLE "B"
> "SUBSCRIBER_NAME" | "ACCOUNT_NUMBER"
> --------------------------------------
> BOB | 000001
>
> To dedup table "A" using the data in table "B" I could use the
> following, except that the dedup takes place on the whole row when I
> only want it to take place on the "ACCOUNT_NUMBER" column.
>
> SELECT
> "A"."SUBSCRIBER_NAME" , "A"."ACCOUNT_NUMBER"
> FROM "A"
> EXCEPT
> SELECT
> "B"."SUBSCRIBER_NAME" , "B"."ACCOUNT_NUMBER"
> FROM "B"
>
>
> I could use this but it takes WAY to long when both result sets contain
> more than a few hundred records. It seems to increase exponentially the
> more records you add to each result set. The data sets I'm working with
> contain anywhere from 0 to 500000 records.
>
> SELECT
> "A"."SUBSCRIBER_NAME" , "A"."ACCOUNT_NUMBER"
> FROM "A"
> WHERE
> "A"."ACCOUNT_NUMBER"
> NOT IN
> (
> SELECT
> "B"."ACCOUNT_NUMBER"
> FROM "B"
> )
>

Big sub queries are a bad idea. Since we know where the number is
going to be why does this not work?

SELECT a.subscriber_name, a.account_number FROM a WHERE
EXISTS (SELECT b.account_number FROM b WHERE
b.account_number=a.account_number);

it should be a lot quicker if your account_number columns are
indexed.

I hope that helps

Peter Childs

In response to

Browse pgsql-sql by date

  From Date Subject
Next Message Stephen.Thompson 2003-02-20 10:09:42 Re: VIEW or Stored Proc - Is this even possible?
Previous Message Josh Berkus 2003-02-20 06:45:15 Re: EXCEPT Queries