Re: Finding rows in table T1 that DO NOT MATCH any row in table T2

From: Scott Carey <scott(at)richrelevance(dot)com>
To: Shaul Dar <shauldar(at)gmail(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: "pgsql-performance(at)postgresql(dot)org" <pgsql-performance(at)postgresql(dot)org>
Subject: Re: Finding rows in table T1 that DO NOT MATCH any row in table T2
Date: 2009-10-21 18:00:37
Message-ID: C7049A55.14BED%scott@richrelevance.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance


On 10/21/09 4:52 AM, "Shaul Dar" <shauldar(at)gmail(dot)com> wrote:

> Tom,
>
> 1. Actually I just tested you suggestion
>
> SELECT COUNT (*) FROM T1 where NOT EXISTS
> (SELECT 1 FROM T2 where T1.PK <http://t1.pk/> = T2.FK <http://t2.fk/> )
>
> and in worked in PG 8.3.8. On a DB with 6M T1 records and 5M T2 records it
> took 1m8s,
>
> My suggestion, i.e.
>
> SELECT COUNT(*) FROM T1 LEFT JOIN T2 ON T1.PK <http://t1.pk/> = T2.FK
> <http://t2.fk/>
> WHERE T2.FK <http://t2.fk/> IS NULL
>
> was about twice as fast, 37s. (both returned same number of rows, about 2/3 of
> T1)
>
> However I can use DELETE with your version (instead of "SELECT COUNT (*)"
> above) but not with mine (can't have LEFT JOIN in DELETE), so YOU WIN. Thanks!
>
> 2. BTW. I presented my question earlier in an overly simplified fashion.
> Sorry. In actuality the two tables are joined on two columns,
> say Ka and Kb (a composite key column), e.g. T1.PKa = T2.FKa and T1.PKb =
> T2.FKb. So the IN versions suggested will not work
> since AFAIK IN only works for a single value.

The performance will stink in many cases, but IN and NOT IN can work on
multiple values, for example:

WHERE (a.key1, a.key2) NOT IN (SELECT b.key1, b.key2 FROM b).

The fastest (in 8.4) is definitely NOT EXISTS.

WHERE NOT EXISTS (SELECT 1 FROM b WHERE (b.key1, b.key2) = (a.key1, a.key2))

I've done this, deleting from tables with 15M + rows where I need a "not
in" on two or three columns on multiple other tables.
However, NOT EXISTS is only fast if every NOT EXISTS clause is a select on
one table, if it is multiple tables and a join, things can get ugly and the
planner might not optimize it right. In that case use two NOT EXISTS
clauses. Always look at the EXPLAIN plan.

With 8.4 -- for performance generally prefer the following:
* prefer JOIN and implicit joins to IN and EXISTS.
* prefer 'NOT EXISTS' to 'NOT IN' or 'LEFT JOIN where (right is null)'

>
> -- Shaul
>
> On Tue, Oct 20, 2009 at 3:59 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>> Shaul Dar <shauldar(at)gmail(dot)com> writes:
>>> I assume this will work but will take a long time:
>>
>>> DELETE * FROM T1 where T1.PK <http://T1.PK> NOT IN
>>> (SELECT T1.PK <http://T1.PK> FROM T1, T2 where T1.PK <http://T1.PK> =
>>> T2.FK <http://T2.FK> )
>>
>> Well, yeah, but it's unnecessarily inefficient --- why not just
>>
>> DELETE FROM T1 where T1.PK <http://T1.PK> NOT IN
>> (SELECT T2.FK <http://T2.FK> FROM T2)
>>
>> However, that still won't be tremendously fast unless the subselect fits
>> in work_mem.  As of 8.4 this variant should be reasonable:
>>
>> DELETE FROM T1 where NOT EXISTS
>> (SELECT 1 FROM T2 where T1.PK <http://T1.PK> = T2.FK <http://T2.FK> )
>>
>> Pre-8.4 you should resort to the "left join where is null" trick,
>> but there's no need to be so obscure as of 8.4.
>>
>>                        regards, tom lane
>
>

In response to

Browse pgsql-performance by date

  From Date Subject
Next Message Tom Lane 2009-10-21 18:35:15 Re: Random penalties on GIN index updates?
Previous Message Jesper Krogh 2009-10-21 17:58:34 Re: Random penalties on GIN index updates?