Improving NOT IN

From: "Simon Riggs" <simon(at)2ndquadrant(dot)com>
To: <pgsql-hackers(at)postgresql(dot)org>
Subject: Improving NOT IN
Date: 2007-01-30 22:03:12
Message-ID: 1170194592.3681.271.camel@silverbirch.site
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

It's a fairly common case to want to improve a query along the lines of
TableA intersect ~TableB.

We can write this as

select * from tableA
where key not in
(select * from tableB)

or we can get more fancy

select tableA.*
from tableA left outer join tableB
on tableA.key = tableB.key
where tableB is null;

I've worked out a new join method that will improve performance over and
above the second case. This is effective where the referenced table
(tableB) is fairly large and the join columns are discrete. Currently
that mostly means they're integers.

The plan seeks to *prove* that there are no matches, rather than taking
the exhaustive join approach taken currently.

First we need to show that the referenced table's PK values are a fully
continuous sequence of integers with no gaps. One this has been proved,
we can then use that fact to scan the FK table using the values of the
min and max PK to see if any outliers exist. There is no actual
comparison of the values, just a proof that none is required.

I'll describe this using SQL statements, which execute as SeqScans of
the PK and then FK tables. There is no preparatory step - no building a
sort table or preparing a hash table, so these SQL statements always
execute faster than the fastest current plan. Most importantly there is
no step that consumes large amounts of memory, so the case where two
tables are very large performs much, much better.

1. Scan referenced table
a) select max(aid), min(aid) from accounts;
b) select count(*) from accounts;
Sometimes this is faster using two queries when the table has a PK.

2. Decision Step
if max - min - count == 0 then we have a contiguous range and because we
know we have a discrete datatype we can now *prove* that there are no
missing values from the set bounded by the min and the max. We can then
use that directly in a new query:

3.
a) Scan referencing table
select aid from history where aid > ? or aid < ?;
using parameters of max and min from step 1

b) normal query

Step 1 & 2 can fail to find a contiguous range, in which case we need to
fall back to an existing query plan. So there is only small overhead in
the case where we run the first query but fail to use the optimisation
at all and need to fall back to existing query. We can estimate whether
this is the case by estimating the row count of the table and see if
that compares favourably with the expected number of values if the whole
range min-max of values is actually present. The min/max query uses the
Primary Key index (which must always be present) so takes very little
time.

So overall this looks like a win, in certain common cases, but not a
particular loss in any case.

Try this SQL

1. select max(aid), min(aid) from accounts;
2. select count(*) from accounts;
3. select aid from history where aid > (select max(aid) from accounts)
or aid < (select min(aid) from accounts) limit 1;

against

alter table history add foreign key (aid) references accounts;

I get (1) about 0.2secs (2) 6secs (3) 9secs against Alter Table 27secs

Using work_mem = 64MB and data that fits in memory

We could implement the new SQL directly within ALTER TABLE, or we could
actually create this as a new plan type that would then allow the
existing SQL to perform better in specific cases.

I've not seen such a plan on any other RDBMS and think it might be
completely new, which I'm calling a Proof Join, for want of a better
description. The preparatory steps are completely excluded, hence the x2
speedup. For larger referenced tables the performance improvement could
be much more.

ISTM that even though this is a special case it is actually a common
one, so would be worth optimising for.

Ideas stage at the moment: thoughts?

--
Simon Riggs
EnterpriseDB http://www.enterprisedb.com

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Simon Riggs 2007-01-30 22:03:42 Re: RI checks during UPDATEs
Previous Message Josh Berkus 2007-01-30 21:48:46 Re: Talks for OSCON? Only 5 days left!