One-line Summary:
Better query optimization for "NOT IN" clause
Business Use-case:
Using x NOT IN (SELECT y FROM target) on extremely large tables can
be done very fast. This
might be necessary in order to introduce foreign keys where old
systems relied on
a user to type in a reference manually. Typically I am useing NOT IN
to find invalid
foreign keys which was not protected by old database designs. More
specifically, the
following query would return what I am looking for:
SELECT source, docno
FROM tableA
WHERE (source, docno) NOT IN (SELECT b.source, b.docno FROM tableB
b)
The query took a few hours (tableB contains 1.8m rows and tableA
7.8m rows)
User impact with the change:
A search for invalid foreign keys can be reduced from hours or days
to seconds. The
reasoning behind this implementation is as follows: If I can return
all rows
from "SELECT col1, col2, ... FROM tableA ORDER BY col1, col2" in few
seconds,
then calculating missing / NOT IN (or existing / IN) values in a few
seconds
as well.
Implementation details:
I have written a general purpose plpgsql function which returns an
SETOF text[]. The function
takes two table names with two sets of column names and optionally
two filters on the
different tables. Unfortunately I had to cast values to text to have
this work in
a plpgsql function, but I got the results I needed.
My required SQL:
SELECT source, docno
FROM tableA
WHERE (source, docno) NOT IN (SELECT b.source, b.docno FROM tableB
b)
The query took a few hours (tableB contains 1.8m rows and tableA
7.8m rows)
I used my function as follows:
SELECT DISTINCT a[1], a[2] FROM util.notIn('tableA', 'source,
docno', 'tableB', 'source, docno', null, null) a ORDER BY a[1], a[2]
This function returned what I needed in 51 seconds, where the NOT IN
statement ran for hours.
Before including the plpgsql source, let me explain what it does:
The function builds up two SQL statements using the given values
(remember I did not specify
any filter on either of the two tables - the last two parameters are
null).
Resulting statements are:
Primary: SELECT DISTINCT ARRAY[source::text, docno::text] FROM
tableA WHERE source IS NOT NULL AND docno IS NOT NULL ORDER BY
ARRAY[source::text, docno::text]
Secondary: SELECT DISTINCT ARRAY[source::text, docno::text] FROM
tableB WHERE source IS NOT NULL AND docno IS NOT NULL ORDER BY
ARRAY[source::text, docno::text]
(As you can see, I stripped out NULLs, because that I can check for
easily with a simple query)
Now I open two cursors and fetch the first record of both statements
into separate variables. Since
the FOUND variable is overwritten with every FETCH, I store the
states in my own variables.
Ordering the results are important, since it is important for the
working of the function.
A loop now advances both or one of the two result sets based on
whether the rows are equal or
not (if not equal, return a row (if primary rows < secondary )
and only fetch from the dataset
which is lesser of the two. You can early terminate when the primary
dataset end has been
reached, alternatively if the secondary dataset reached its end, you
can simply add all remaining
rows from the primary dataset to the result.
The best implementation for this would be to detect a NOT IN (or IN)
WHERE clause during query optimization
phase and use a sequential scan as done in the function. Important
is that the IN / NOT IN select statement
must not be dependent on the current row of the main SELECT.
A real long term solution would actually be to extend the SQL
definition language to allow for a different
type of join. To optimize IN, you could rewrite your query as an
INNER JOIN (providing you you join to
a DISTINCT dataset). However, the NOT IN is not so easy (except for
LEFT JOIN testing for NULL in secondary
table). What I would propose is the following:
SELECT a.colA
FROM TableA
[LEFT | RIGHT] MISSING JOIN TableB ON TableB.colB = TableA.colA
CREATE OR REPLACE FUNCTION util.notIn(pTableA text, pFieldA text,
pTableB text, pFieldB text, pFilterA text, pFilterB text) RETURNS
SETOF text[] AS
$BODY$
DECLARE
vFieldsA text[];
vFieldsB text[];
vValueA text[];
vValueB text[];
vRowA record;
vRowB record;
vSQLA text;
vSQLB text;
vWhereA text;
vWhereB text;
vSelectA text;
vSelectB text;
vCursorA refcursor;
vCursorB refcursor;
vFirst integer;
vLast integer;
vNdx integer;
vMoreA boolean;
vMoreB boolean;
BEGIN
IF pTableA IS NULL OR pTableB IS NULL OR pFieldA IS NULL OR
pFieldB IS NULL THEN
RAISE EXCEPTION 'pTableA, pTableB, pFieldA and pFieldB
parameters are required';
END IF;
vFieldsA := regexp_split_to_array(pFieldA, '[,]');
vFieldsB := regexp_split_to_array(pFieldB, '[,]');
vFirst := array_lower(vFieldsA, 1);
IF array_length(vFieldsA, 1) <> array_length(vFieldsB, 1)
THEN
RAISE EXCEPTION 'pFieldA and pFieldB field lists must contain
the same number of field names';
END IF;
vLast := array_upper(vFieldsA, 1);
vWhereA := ' WHERE ';
vWhereB := ' WHERE ';
vSelectA := '';
vSelectB := '';
FOR vNdx IN vFirst .. vLast LOOP
vFieldsA[vNdx] := trim(vFieldsA[vNdx]);
vFieldsB[vNdx] := trim(vFieldsB[vNdx]);
IF vNdx > 1 THEN
vSelectA := vSelectA || ', ';
vSelectB := vSelectB || ', ';
vWhereA := vWhereA || ' AND ';
vWhereB := vWhereB || ' AND ';
END IF;
vSelectA := vSelectA || vFieldsA[vNdx] || '::text';
vSelectB := vSelectB || vFieldsB[vNdx] || '::text';
vWhereA := vWhereA || vFieldsA[vNdx] || ' IS NOT NULL';
vWhereB := vWhereB || vFieldsB[vNdx] || ' IS NOT NULL';
END LOOP;
vSQLA := 'SELECT DISTINCT ARRAY[' || vSelectA || '] FROM ' ||
pTableA || vWhereA || ' ORDER BY ARRAY[' || vSelectA || ']';
vSQLB := 'SELECT DISTINCT ARRAY[' || vSelectB || '] FROM ' ||
pTableB || vWhereB || ' ORDER BY ARRAY[' || vSelectB || ']';
IF COALESCE(trim(pFilterA), '') <> '' THEN
vSQLA := vSQLA || ' AND ' || pFilterA;
END IF;
IF COALESCE(trim(pFilterB), '') <> '' THEN
vSQLB := vSQLB || ' AND ' || pFilterB;
END IF;
RAISE NOTICE 'Primary data: %', vSQLA;
RAISE NOTICE 'Secondary data: %', vSQLB;
OPEN vCursorA FOR EXECUTE vSQLA;
OPEN vCursorB FOR EXECUTE vSQLB;
FETCH vCursorA INTO vValueA;
vMoreA := FOUND;
FETCH vCursorB INTO vValueB;
vMoreB := FOUND;
WHILE vMoreA LOOP
IF vMoreB THEN
IF vValueA = vValueB THEN
FETCH vCursorA INTO vValueA;
vMoreA := FOUND;
FETCH vCursorB INTO vValueB;
vMoreB := FOUND;
ELSEIF vValueA < vValueB THEN
RETURN NEXT vValueA;
FETCH vCursorA INTO vValueA;
vMoreA := FOUND;
ELSE
FETCH vCursorB INTO vValueB;
vMoreB := FOUND;
END IF;
ELSE
RETURN NEXT vValueA;
FETCH vCursorA INTO vValueA;
vMoreA := FOUND;
END IF;
END LOOP;
CLOSE vCursorA;
CLOSE vCursorB;
RETURN;
END;
$BODY$
LANGUAGE plpgsql;
Estimated Development Time:
unknown
Opportunity Window Period:
none - PostgreSQL will have better performance whenever this is
added
Budget Money:
none - I have solved my problem already, but I think it would be a
great
improvement on PostgreSQL performance
Contact Information:
john@softco.co.za
Category:
[[Category:Proposals]]
[[Category:Performance]]