One-line Summary:
Better query optimization for "NOT IN" clause



Business Use-case:
Using x NOT IN (SELECT y FROM target) on extremely large tables can be done very fast. This
might be necessary in order to introduce foreign keys where old systems relied on
a user to type in a reference manually. Typically I am useing NOT IN to find invalid
foreign keys which was not protected by old database designs. More specifically, the
following query would return what I am looking for:
SELECT source, docno
FROM tableA
WHERE (source, docno) NOT IN (SELECT b.source, b.docno FROM tableB b)
The query took a few hours (tableB contains 1.8m rows and tableA 7.8m rows)



User impact with the change:
A search for invalid foreign keys can be reduced from hours or days to seconds. The
reasoning behind this implementation is as follows: If I can return all rows
from "SELECT col1, col2, ... FROM tableA ORDER BY col1, col2" in few seconds,
then calculating missing / NOT IN (or existing / IN) values in a few seconds
as well.



Implementation details:
I have written a general purpose plpgsql function which returns an SETOF text[]. The function
takes two table names with two sets of column names and optionally two filters on the
different tables. Unfortunately I had to cast values to text to have this work in
a plpgsql function, but I got the results I needed.

My required SQL:
SELECT source, docno
FROM tableA
WHERE (source, docno) NOT IN (SELECT b.source, b.docno FROM tableB b)

The query took a few hours (tableB contains 1.8m rows and tableA 7.8m rows)

I used my function as follows:
SELECT DISTINCT a[1], a[2] FROM util.notIn('tableA', 'source, docno', 'tableB', 'source, docno', null, null) a ORDER BY a[1], a[2]
This function returned what I needed in 51 seconds, where the NOT IN statement ran for hours.

Before including the plpgsql source, let me explain what it does:
The function builds up two SQL statements using the given values (remember I did not specify
any filter on either of the two tables - the last two parameters are null).
Resulting statements are:

Primary: SELECT DISTINCT ARRAY[source::text, docno::text] FROM tableA WHERE source IS NOT NULL AND docno IS NOT NULL ORDER BY ARRAY[source::text, docno::text]
Secondary: SELECT DISTINCT ARRAY[source::text, docno::text] FROM tableB WHERE source IS NOT NULL AND docno IS NOT NULL ORDER BY ARRAY[source::text, docno::text]

(As you can see, I stripped out NULLs, because that I can check for easily with a simple query)
Now I open two cursors and fetch the first record of both statements into separate variables. Since
the FOUND variable is overwritten with every FETCH, I store the states in my own variables.
Ordering the results are important, since it is important for the working of the function.
A loop now advances both or one of the two result sets based on whether the rows are equal or
not (if not equal, return a row (if primary rows < secondary ) and only fetch from the dataset
which is lesser of the two. You can early terminate when the primary dataset end has been
reached, alternatively if the secondary dataset reached its end, you can simply add all remaining
rows from the primary dataset to the result.

The best implementation for this would be to detect a NOT IN (or IN) WHERE clause during query optimization
phase and use a sequential scan as done in the function. Important is that the IN / NOT IN select statement
must not be dependent on the current row of the main SELECT.

A real long term solution would actually be to extend the SQL definition language to allow for a different
type of join. To optimize IN, you could rewrite your query as an INNER JOIN (providing you you join to
a DISTINCT dataset). However, the NOT IN is not so easy (except for LEFT JOIN testing for NULL in secondary
table). What I would propose is the following:
SELECT a.colA
FROM TableA
[LEFT | RIGHT] MISSING JOIN TableB ON TableB.colB = TableA.colA

CREATE OR REPLACE FUNCTION util.notIn(pTableA text, pFieldA text, pTableB text, pFieldB text, pFilterA text, pFilterB text) RETURNS SETOF text[] AS
$BODY$
DECLARE
  vFieldsA text[];
  vFieldsB text[];
  vValueA text[];
  vValueB text[];
  vRowA record;
  vRowB record;
  vSQLA text;
  vSQLB text;
  vWhereA text;
  vWhereB text;
  vSelectA text;
  vSelectB text;
  vCursorA refcursor;
  vCursorB refcursor;
  vFirst integer;
  vLast integer;
  vNdx integer;
  vMoreA boolean;
  vMoreB boolean;
BEGIN
  IF pTableA IS NULL OR pTableB IS NULL OR pFieldA IS NULL OR pFieldB IS NULL THEN
    RAISE EXCEPTION 'pTableA, pTableB, pFieldA and pFieldB parameters are required';
  END IF;
  vFieldsA := regexp_split_to_array(pFieldA, '[,]');
  vFieldsB := regexp_split_to_array(pFieldB, '[,]');
  vFirst := array_lower(vFieldsA, 1);
  IF array_length(vFieldsA, 1) <> array_length(vFieldsB, 1) THEN
    RAISE EXCEPTION 'pFieldA and pFieldB field lists must contain the same number of field names';
  END IF;
  vLast := array_upper(vFieldsA, 1);
  vWhereA := ' WHERE ';
  vWhereB := ' WHERE ';
  vSelectA := '';
  vSelectB := '';
  FOR vNdx IN vFirst .. vLast LOOP
    vFieldsA[vNdx] := trim(vFieldsA[vNdx]);
    vFieldsB[vNdx] := trim(vFieldsB[vNdx]);
    IF vNdx > 1 THEN
      vSelectA := vSelectA || ', ';
      vSelectB := vSelectB || ', ';
      vWhereA := vWhereA || ' AND ';
      vWhereB := vWhereB || ' AND ';
    END IF;
    vSelectA := vSelectA || vFieldsA[vNdx] || '::text';
    vSelectB := vSelectB || vFieldsB[vNdx] || '::text';
    vWhereA := vWhereA || vFieldsA[vNdx] || ' IS NOT NULL';
    vWhereB := vWhereB || vFieldsB[vNdx] || ' IS NOT NULL';
  END LOOP;
  vSQLA := 'SELECT DISTINCT ARRAY[' || vSelectA || '] FROM ' || pTableA || vWhereA || ' ORDER BY ARRAY[' || vSelectA || ']';
  vSQLB := 'SELECT DISTINCT ARRAY[' || vSelectB || '] FROM ' || pTableB || vWhereB || ' ORDER BY ARRAY[' || vSelectB || ']';
  IF COALESCE(trim(pFilterA), '') <> '' THEN
    vSQLA := vSQLA || ' AND ' || pFilterA;
  END IF;
  IF COALESCE(trim(pFilterB), '') <> '' THEN
    vSQLB := vSQLB || ' AND ' || pFilterB;
  END IF;
  RAISE NOTICE 'Primary data: %', vSQLA;
  RAISE NOTICE 'Secondary data: %', vSQLB;
  OPEN vCursorA FOR EXECUTE vSQLA;
  OPEN vCursorB FOR EXECUTE vSQLB;
  FETCH vCursorA INTO vValueA;
  vMoreA := FOUND;
  FETCH vCursorB INTO vValueB;
  vMoreB := FOUND;
  WHILE vMoreA LOOP
    IF vMoreB THEN
      IF vValueA = vValueB THEN
        FETCH vCursorA INTO vValueA;
        vMoreA := FOUND;
        FETCH vCursorB INTO vValueB;
        vMoreB := FOUND;
      ELSEIF vValueA < vValueB THEN
        RETURN NEXT vValueA;
        FETCH vCursorA INTO vValueA;
        vMoreA := FOUND;
      ELSE
        FETCH vCursorB INTO vValueB;
        vMoreB := FOUND;
      END IF;
    ELSE
      RETURN NEXT vValueA;
      FETCH vCursorA INTO vValueA;
      vMoreA := FOUND;
    END IF;
  END LOOP;
  CLOSE vCursorA;
  CLOSE vCursorB;
  RETURN;
END;
$BODY$
LANGUAGE plpgsql;



Estimated Development Time:
unknown



Opportunity Window Period:
none - PostgreSQL will have better performance whenever this is added



Budget Money:
none - I have solved my problem already, but I think it would be a great
improvement on PostgreSQL performance



Contact Information:
john@softco.co.za



Category:
[[Category:Proposals]]
[[Category:Performance]]