Re: "SELECT .. WHERE NOT IN" query running for hours

From: Γιωργος Βαλκανας <lebiathan(at)gmail(dot)com>
To: Mladen Gogala <mladen(dot)gogala(at)vmsinfo(dot)com>
Cc: "pgsql-performance(at)postgresql(dot)org" <pgsql-performance(at)postgresql(dot)org>
Subject: Re: "SELECT .. WHERE NOT IN" query running for hours
Date: 2011-01-07 09:29:32
Message-ID: AANLkTimCc6U-xHigzNcjF3xABadg-yKV_1+kTqXRXhnL@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance

Fair enough!

I also turned seqscan off, so the new plan (for the NOT EXISTS) is:

Merge Anti Join (cost=0.00..212686.89 rows=1 width=313) (actual
time=0.426..14921.344 rows=63836 loops=1)
Merge Cond: ((d2.hwdocid)::text = (d.hwdocid)::text)
-> Index Scan using hwdocid2_uniq on "Doc2" d2 (cost=0.00..19442.87
rows=96454 width=313) (actual time=0.130..1248.783 rows=96454 loops=1)
-> Index Scan using hwdocid_uniq on "Document" d (cost=0.00..189665.17
rows=949272 width=12) (actual time=0.085..11158.740 rows=948336 loops=1)
Total runtime: 15062.925 ms

Hmm.. doesn't really seem to be such a great boost on performance. But i
guess I'll be sticking to this one.

So my follow-up question on the subject is this:

Are there any particular semantics for the "NOT IN" statement that cause the
correlated query to execute for every row of the outter query, as opposed to
the "NOT EXISTS" ? Or are there any other practical reasons, related to "IN
/ NOT IN", for this to be happening? Or is it simply due to implementation
details of each RDBMS? I guess the former (or the 2nd one), since, as you
say, this is common in most databases, but I would most appreciate an answer
to clarify this.

Thanks again!

Best regards,
George

2011/1/7 Mladen Gogala <mladen(dot)gogala(at)vmsinfo(dot)com>

> On 1/6/2011 9:36 PM, Γιωργος Βαλκανας wrote:
>
>>
>> 1) Why is it taking *so* long for the first query (with the "NOT IN" ) to
>> do even the simple select?
>>
> Because NOT IN has to execute the correlated subquery for every row and
> then check whether the requested value is in the result set, usually by
> doing sequential comparison. The NOT EXIST plan is also bad because there is
> no index but at least it can use very fast and efficient hash algorithm.
> Indexing the "hwdocid" column on the "Document" table or, ideally, making it
> a primary key, should provide an additional boost to your query. If you
> already do have an index, you may consider using enable_seqscan=false for
> this session, so that the "hwdocid" index will be used. It's a common wisdom
> that in the most cases NOT EXISTS will beat NOT IN. That is so all over the
> database world. I've seen that in Oracle applications, MS SQL applications
> and, of course MySQL applications. Optimizing queries is far from trivial.
>
> Μλαδεν Γογαλα
>
> --
> Mladen Gogala
> Sr. Oracle DBA
> 1500 Broadway
> New York, NY 10036
> (212) 329-5251
> www.vmsinfo.com
>
>

In response to

Responses

Browse pgsql-performance by date

  From Date Subject
Next Message Florian Weimer 2011-01-07 12:45:25 Wrong docs on checkpoint_segments?
Previous Message marc.hsiao 2011-01-07 08:14:41 Re: How to turn autovacuum prevent wrap around run faster?