<html>
<head>
<meta http-equiv="content-type" content="text/html; charset=UTF-8">
</head>
<body text="#000000" bgcolor="#FFFFFF">
<b>One-line Summary:</b><br>
Better query optimization for "NOT IN" clause<br>
<br>
<br>
<b><br>
</b><b>Business Use-case:</b><br>
Using x NOT IN (SELECT y FROM target) on extremely large tables can
be done very fast. This<br>
might be necessary in order to introduce foreign keys where old
systems relied on<br>
a user to type in a reference manually. Typically I am useing NOT IN
to find invalid<br>
foreign keys which was not protected by old database designs. More
specifically, the<br>
following query would return what I am looking for:<br>
SELECT source, docno<br>
FROM tableA<br>
WHERE (source, docno) NOT IN (SELECT b.source, b.docno FROM tableB
b)<br>
The query took a few hours (tableB contains 1.8m rows and tableA
7.8m rows)<br>
<br>
<br>
<br>
<b>User impact with the change:</b><br>
A search for invalid foreign keys can be reduced from hours or days
to seconds. The<br>
reasoning behind this implementation is as follows: If I can return
all rows<br>
from "SELECT col1, col2, ... FROM tableA ORDER BY col1, col2" in few
seconds,<br>
then calculating missing / NOT IN (or existing / IN) values in a few
seconds<br>
as well.<br>
<br>
<br>
<b><br>
</b><b>Implementation details:</b><br>
I have written a general purpose plpgsql function which returns an
SETOF text[]. The function<br>
takes two table names with two sets of column names and optionally
two filters on the<br>
different tables. Unfortunately I had to cast values to text to have
this work in<br>
a plpgsql function, but I got the results I needed.<br>
<br>
My required SQL:<br>
SELECT source, docno<br>
FROM tableA<br>
WHERE (source, docno) NOT IN (SELECT b.source, b.docno FROM tableB
b)<br>
<br>
The query took a few hours (tableB contains 1.8m rows and tableA
7.8m rows)<br>
<br>
I used my function as follows:<br>
SELECT DISTINCT a[1], a[2] FROM util.notIn('tableA', 'source,
docno', 'tableB', 'source, docno', null, null) a ORDER BY a[1], a[2]<br>
This function returned what I needed in 51 seconds, where the NOT IN
statement ran for hours.<br>
<br>
Before including the plpgsql source, let me explain what it does:<br>
The function builds up two SQL statements using the given values
(remember I did not specify<br>
any filter on either of the two tables - the last two parameters are
null).<br>
Resulting statements are:<br>
<br>
Primary: SELECT DISTINCT ARRAY[source::text, docno::text] FROM
tableA WHERE source IS NOT NULL AND docno IS NOT NULL ORDER BY
ARRAY[source::text, docno::text]<br>
Secondary: SELECT DISTINCT ARRAY[source::text, docno::text] FROM
tableB WHERE source IS NOT NULL AND docno IS NOT NULL ORDER BY
ARRAY[source::text, docno::text]<br>
<br>
(As you can see, I stripped out NULLs, because that I can check for
easily with a simple query)<br>
Now I open two cursors and fetch the first record of both statements
into separate variables. Since<br>
the FOUND variable is overwritten with every FETCH, I store the
states in my own variables.<br>
Ordering the results are important, since it is important for the
working of the function.<br>
A loop now advances both or one of the two result sets based on
whether the rows are equal or<br>
not (if not equal, return a row (if primary rows < secondary )
and only fetch from the dataset<br>
which is lesser of the two. You can early terminate when the primary
dataset end has been<br>
reached, alternatively if the secondary dataset reached its end, you
can simply add all remaining<br>
rows from the primary dataset to the result.<br>
<br>
The best implementation for this would be to detect a NOT IN (or IN)
WHERE clause during query optimization<br>
phase and use a sequential scan as done in the function. Important
is that the IN / NOT IN select statement<br>
must not be dependent on the current row of the main SELECT.<br>
<br>
A real long term solution would actually be to extend the SQL
definition language to allow for a different<br>
type of join. To optimize IN, you could rewrite your query as an
INNER JOIN (providing you you join to<br>
a DISTINCT dataset). However, the NOT IN is not so easy (except for
LEFT JOIN testing for NULL in secondary<br>
table). What I would propose is the following:<br>
SELECT a.colA<br>
FROM TableA<br>
[LEFT | RIGHT] MISSING JOIN TableB ON TableB.colB = TableA.colA<br>
<br>
CREATE OR REPLACE FUNCTION util.notIn(pTableA text, pFieldA text,
pTableB text, pFieldB text, pFilterA text, pFilterB text) RETURNS
SETOF text[] AS<br>
$BODY$<br>
DECLARE<br>
vFieldsA text[];<br>
vFieldsB text[];<br>
vValueA text[];<br>
vValueB text[];<br>
vRowA record;<br>
vRowB record;<br>
vSQLA text;<br>
vSQLB text;<br>
vWhereA text;<br>
vWhereB text;<br>
vSelectA text;<br>
vSelectB text;<br>
vCursorA refcursor;<br>
vCursorB refcursor;<br>
vFirst integer;<br>
vLast integer;<br>
vNdx integer;<br>
vMoreA boolean;<br>
vMoreB boolean;<br>
BEGIN<br>
IF pTableA IS NULL OR pTableB IS NULL OR pFieldA IS NULL OR
pFieldB IS NULL THEN<br>
RAISE EXCEPTION 'pTableA, pTableB, pFieldA and pFieldB
parameters are required';<br>
END IF;<br>
vFieldsA := regexp_split_to_array(pFieldA, '[,]');<br>
vFieldsB := regexp_split_to_array(pFieldB, '[,]');<br>
vFirst := array_lower(vFieldsA, 1);<br>
IF array_length(vFieldsA, 1) <> array_length(vFieldsB, 1)
THEN<br>
RAISE EXCEPTION 'pFieldA and pFieldB field lists must contain
the same number of field names';<br>
END IF;<br>
vLast := array_upper(vFieldsA, 1);<br>
vWhereA := ' WHERE ';<br>
vWhereB := ' WHERE ';<br>
vSelectA := '';<br>
vSelectB := '';<br>
FOR vNdx IN vFirst .. vLast LOOP<br>
vFieldsA[vNdx] := trim(vFieldsA[vNdx]);<br>
vFieldsB[vNdx] := trim(vFieldsB[vNdx]);<br>
IF vNdx > 1 THEN<br>
vSelectA := vSelectA || ', ';<br>
vSelectB := vSelectB || ', ';<br>
vWhereA := vWhereA || ' AND ';<br>
vWhereB := vWhereB || ' AND ';<br>
END IF;<br>
vSelectA := vSelectA || vFieldsA[vNdx] || '::text';<br>
vSelectB := vSelectB || vFieldsB[vNdx] || '::text';<br>
vWhereA := vWhereA || vFieldsA[vNdx] || ' IS NOT NULL';<br>
vWhereB := vWhereB || vFieldsB[vNdx] || ' IS NOT NULL';<br>
END LOOP;<br>
vSQLA := 'SELECT DISTINCT ARRAY[' || vSelectA || '] FROM ' ||
pTableA || vWhereA || ' ORDER BY ARRAY[' || vSelectA || ']';<br>
vSQLB := 'SELECT DISTINCT ARRAY[' || vSelectB || '] FROM ' ||
pTableB || vWhereB || ' ORDER BY ARRAY[' || vSelectB || ']';<br>
IF COALESCE(trim(pFilterA), '') <> '' THEN<br>
vSQLA := vSQLA || ' AND ' || pFilterA;<br>
END IF;<br>
IF COALESCE(trim(pFilterB), '') <> '' THEN<br>
vSQLB := vSQLB || ' AND ' || pFilterB;<br>
END IF;<br>
RAISE NOTICE 'Primary data: %', vSQLA;<br>
RAISE NOTICE 'Secondary data: %', vSQLB;<br>
OPEN vCursorA FOR EXECUTE vSQLA;<br>
OPEN vCursorB FOR EXECUTE vSQLB;<br>
FETCH vCursorA INTO vValueA;<br>
vMoreA := FOUND;<br>
FETCH vCursorB INTO vValueB;<br>
vMoreB := FOUND;<br>
WHILE vMoreA LOOP<br>
IF vMoreB THEN<br>
IF vValueA = vValueB THEN<br>
FETCH vCursorA INTO vValueA;<br>
vMoreA := FOUND;<br>
FETCH vCursorB INTO vValueB;<br>
vMoreB := FOUND;<br>
ELSEIF vValueA < vValueB THEN<br>
RETURN NEXT vValueA;<br>
FETCH vCursorA INTO vValueA;<br>
vMoreA := FOUND;<br>
ELSE<br>
FETCH vCursorB INTO vValueB;<br>
vMoreB := FOUND;<br>
END IF;<br>
ELSE<br>
RETURN NEXT vValueA;<br>
FETCH vCursorA INTO vValueA;<br>
vMoreA := FOUND;<br>
END IF;<br>
END LOOP;<br>
CLOSE vCursorA;<br>
CLOSE vCursorB;<br>
RETURN;<br>
END;<br>
$BODY$<br>
LANGUAGE plpgsql;<br>
<br>
<br>
<b><br>
</b><b>Estimated Development Time:</b><br>
unknown<br>
<br>
<br>
<b><br>
</b><b>Opportunity Window Period:</b><br>
none - PostgreSQL will have better performance whenever this is
added<br>
<br>
<br>
<br>
<b>Budget Money:</b><br>
none - I have solved my problem already, but I think it would be a
great<br>
improvement on PostgreSQL performance<br>
<br>
<b><br>
</b><b><br>
</b><b>Contact Information:</b><br>
<a class="moz-txt-link-abbreviated" href="mailto:john(at)softco(dot)co(dot)za">john(at)softco(dot)co(dot)za</a><br>
<br>
<br>
<b><br>
</b><b>Category:</b><br>
[[Category:Proposals]]<br>
[[Category:Performance]]<br>
<br>
</body>
</html>