Proposal: Better query optimization for "NOT IN" clause

From: John Bester <john(at)softco(dot)co(dot)za>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Proposal: Better query optimization for "NOT IN" clause
Date: 2019-09-23 13:43:34
Message-ID: b1757396-a07f-2c3b-e358-0e8b1c0ff34a@softco.co.za
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

<html>
<head>

<meta http-equiv="content-type" content="text/html; charset=UTF-8">
</head>
<body text="#000000" bgcolor="#FFFFFF">
<b>One-line Summary:</b><br>
Better query optimization for "NOT IN" clause<br>
<br>
<br>
<b><br>
</b><b>Business Use-case:</b><br>
Using x NOT IN (SELECT y FROM target) on extremely large tables can
be done very fast. This<br>
might be necessary in order to introduce foreign keys where old
systems relied on<br>
a user to type in a reference manually. Typically I am useing NOT IN
to find invalid<br>
foreign keys which was not protected by old database designs. More
specifically, the<br>
following query would return what I am looking for:<br>
SELECT source, docno<br>
FROM tableA<br>
WHERE (source, docno) NOT IN (SELECT b.source, b.docno FROM tableB
b)<br>
The query took a few hours (tableB contains 1.8m rows and tableA
7.8m rows)<br>
<br>
<br>
<br>
<b>User impact with the change:</b><br>
A search for invalid foreign keys can be reduced from hours or days
to seconds. The<br>
reasoning behind this implementation is as follows: If I can return
all rows<br>
from "SELECT col1, col2, ... FROM tableA ORDER BY col1, col2" in few
seconds,<br>
then calculating missing / NOT IN (or existing / IN) values in a few
seconds<br>
as well.<br>
<br>
<br>
<b><br>
</b><b>Implementation details:</b><br>
I have written a general purpose plpgsql function which returns an
SETOF text[]. The function<br>
takes two table names with two sets of column names and optionally
two filters on the<br>
different tables. Unfortunately I had to cast values to text to have
this work in<br>
a plpgsql function, but I got the results I needed.<br>
<br>
My required SQL:<br>
SELECT source, docno<br>
FROM tableA<br>
WHERE (source, docno) NOT IN (SELECT b.source, b.docno FROM tableB
b)<br>
<br>
The query took a few hours (tableB contains 1.8m rows and tableA
7.8m rows)<br>
<br>
I used my function as follows:<br>
SELECT DISTINCT a[1], a[2] FROM util.notIn('tableA', 'source,
docno', 'tableB', 'source, docno', null, null) a ORDER BY a[1], a[2]<br>
This function returned what I needed in 51 seconds, where the NOT IN
statement ran for hours.<br>
<br>
Before including the plpgsql source, let me explain what it does:<br>
The function builds up two SQL statements using the given values
(remember I did not specify<br>
any filter on either of the two tables - the last two parameters are
null).<br>
Resulting statements are:<br>
<br>
Primary: SELECT DISTINCT ARRAY[source::text, docno::text] FROM
tableA WHERE source IS NOT NULL AND docno IS NOT NULL ORDER BY
ARRAY[source::text, docno::text]<br>
Secondary: SELECT DISTINCT ARRAY[source::text, docno::text] FROM
tableB WHERE source IS NOT NULL AND docno IS NOT NULL ORDER BY
ARRAY[source::text, docno::text]<br>
<br>
(As you can see, I stripped out NULLs, because that I can check for
easily with a simple query)<br>
Now I open two cursors and fetch the first record of both statements
into separate variables. Since<br>
the FOUND variable is overwritten with every FETCH, I store the
states in my own variables.<br>
Ordering the results are important, since it is important for the
working of the function.<br>
A loop now advances both or one of the two result sets based on
whether the rows are equal or<br>
not (if not equal, return a row (if primary rows &lt; secondary )
and only fetch from the dataset<br>
which is lesser of the two. You can early terminate when the primary
dataset end has been<br>
reached, alternatively if the secondary dataset reached its end, you
can simply add all remaining<br>
rows from the primary dataset to the result.<br>
<br>
The best implementation for this would be to detect a NOT IN (or IN)
WHERE clause during query optimization<br>
phase and use a sequential scan as done in the function. Important
is that the IN / NOT IN select statement<br>
must not be dependent on the current row of the main SELECT.<br>
<br>
A real long term solution would actually be to extend the SQL
definition language to allow for a different<br>
type of join. To optimize IN, you could rewrite your query as an
INNER JOIN (providing you you join to<br>
a DISTINCT dataset). However, the NOT IN is not so easy (except for
LEFT JOIN testing for NULL in secondary<br>
table). What I would propose is the following:<br>
SELECT a.colA<br>
FROM TableA<br>
[LEFT | RIGHT] MISSING JOIN TableB ON TableB.colB = TableA.colA<br>
<br>
CREATE OR REPLACE FUNCTION util.notIn(pTableA text, pFieldA text,
pTableB text, pFieldB text, pFilterA text, pFilterB text) RETURNS
SETOF text[] AS<br>
$BODY$<br>
DECLARE<br>
  vFieldsA text[];<br>
  vFieldsB text[];<br>
  vValueA text[];<br>
  vValueB text[];<br>
  vRowA record;<br>
  vRowB record;<br>
  vSQLA text;<br>
  vSQLB text;<br>
  vWhereA text;<br>
  vWhereB text;<br>
  vSelectA text;<br>
  vSelectB text;<br>
  vCursorA refcursor;<br>
  vCursorB refcursor;<br>
  vFirst integer;<br>
  vLast integer;<br>
  vNdx integer;<br>
  vMoreA boolean;<br>
  vMoreB boolean;<br>
BEGIN<br>
  IF pTableA IS NULL OR pTableB IS NULL OR pFieldA IS NULL OR
pFieldB IS NULL THEN<br>
    RAISE EXCEPTION 'pTableA, pTableB, pFieldA and pFieldB
parameters are required';<br>
  END IF;<br>
  vFieldsA := regexp_split_to_array(pFieldA, '[,]');<br>
  vFieldsB := regexp_split_to_array(pFieldB, '[,]');<br>
  vFirst := array_lower(vFieldsA, 1);<br>
  IF array_length(vFieldsA, 1) &lt;&gt; array_length(vFieldsB, 1)
THEN<br>
    RAISE EXCEPTION 'pFieldA and pFieldB field lists must contain
the same number of field names';<br>
  END IF;<br>
  vLast := array_upper(vFieldsA, 1);<br>
  vWhereA := ' WHERE ';<br>
  vWhereB := ' WHERE ';<br>
  vSelectA := '';<br>
  vSelectB := '';<br>
  FOR vNdx IN vFirst .. vLast LOOP<br>
    vFieldsA[vNdx] := trim(vFieldsA[vNdx]);<br>
    vFieldsB[vNdx] := trim(vFieldsB[vNdx]);<br>
    IF vNdx &gt; 1 THEN<br>
      vSelectA := vSelectA || ', ';<br>
      vSelectB := vSelectB || ', ';<br>
      vWhereA := vWhereA || ' AND ';<br>
      vWhereB := vWhereB || ' AND ';<br>
    END IF;<br>
    vSelectA := vSelectA || vFieldsA[vNdx] || '::text';<br>
    vSelectB := vSelectB || vFieldsB[vNdx] || '::text';<br>
    vWhereA := vWhereA || vFieldsA[vNdx] || ' IS NOT NULL';<br>
    vWhereB := vWhereB || vFieldsB[vNdx] || ' IS NOT NULL';<br>
  END LOOP;<br>
  vSQLA := 'SELECT DISTINCT ARRAY[' || vSelectA || '] FROM ' ||
pTableA || vWhereA || ' ORDER BY ARRAY[' || vSelectA || ']';<br>
  vSQLB := 'SELECT DISTINCT ARRAY[' || vSelectB || '] FROM ' ||
pTableB || vWhereB || ' ORDER BY ARRAY[' || vSelectB || ']';<br>
  IF COALESCE(trim(pFilterA), '') &lt;&gt; '' THEN<br>
    vSQLA := vSQLA || ' AND ' || pFilterA;<br>
  END IF;<br>
  IF COALESCE(trim(pFilterB), '') &lt;&gt; '' THEN<br>
    vSQLB := vSQLB || ' AND ' || pFilterB;<br>
  END IF;<br>
  RAISE NOTICE 'Primary data: %', vSQLA;<br>
  RAISE NOTICE 'Secondary data: %', vSQLB;<br>
  OPEN vCursorA FOR EXECUTE vSQLA;<br>
  OPEN vCursorB FOR EXECUTE vSQLB;<br>
  FETCH vCursorA INTO vValueA;<br>
  vMoreA := FOUND;<br>
  FETCH vCursorB INTO vValueB;<br>
  vMoreB := FOUND;<br>
  WHILE vMoreA LOOP<br>
    IF vMoreB THEN<br>
      IF vValueA = vValueB THEN<br>
        FETCH vCursorA INTO vValueA;<br>
        vMoreA := FOUND;<br>
        FETCH vCursorB INTO vValueB;<br>
        vMoreB := FOUND;<br>
      ELSEIF vValueA &lt; vValueB THEN<br>
        RETURN NEXT vValueA;<br>
        FETCH vCursorA INTO vValueA;<br>
        vMoreA := FOUND;<br>
      ELSE<br>
        FETCH vCursorB INTO vValueB;<br>
        vMoreB := FOUND;<br>
      END IF;<br>
    ELSE<br>
      RETURN NEXT vValueA;<br>
      FETCH vCursorA INTO vValueA;<br>
      vMoreA := FOUND;<br>
    END IF;<br>
  END LOOP;<br>
  CLOSE vCursorA;<br>
  CLOSE vCursorB;<br>
  RETURN;<br>
END;<br>
$BODY$<br>
LANGUAGE plpgsql;<br>
<br>
<br>
<b><br>
</b><b>Estimated Development Time:</b><br>
unknown<br>
<br>
<br>
<b><br>
</b><b>Opportunity Window Period:</b><br>
none - PostgreSQL will have better performance whenever this is
added<br>
<br>
<br>
<br>
<b>Budget Money:</b><br>
none - I have solved my problem already, but I think it would be a
great<br>
improvement on PostgreSQL performance<br>
<br>
<b><br>
</b><b><br>
</b><b>Contact Information:</b><br>
<a class="moz-txt-link-abbreviated" href="mailto:john(at)softco(dot)co(dot)za">john(at)softco(dot)co(dot)za</a><br>
<br>
<br>
<b><br>
</b><b>Category:</b><br>
[[Category:Proposals]]<br>
[[Category:Performance]]<br>
<br>
</body>
</html>

Attachment Content-Type Size
unknown_filename text/html 9.4 KB

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Binguo Bao 2019-09-23 13:55:24 Re: [proposal] de-TOAST'ing using a iterator
Previous Message Robert Haas 2019-09-23 12:58:00 Re: Unwanted expression simplification in PG12b2