Okay to tighten definition of oprcanhash?

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: pgsql-hackers(at)postgreSQL(dot)org
Subject: Okay to tighten definition of oprcanhash?
Date: 2002-12-20 20:28:45
Message-ID: 24613.1040416125@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

I have been looking into the possibility of using a hashtable to speed
up "x IN (SELECT y FROM ...)" operations. Basically the idea is to run
the subselect once, loading its "y" outputs into an in-memory hashtable
(any duplicates can be discarded); then for each outer row, probe into
the hashtable to see if there's a match for the outer "x" value.

This relatively simple idea gets a lot messier as soon as you start to
consider the fine points of SQL92 "IN" semantics, however. In
particular, if x is null or any particular y is null, the result of the
IN should be null ("unknown"), not false. What the spec really says is:

1. If "x = y" yields TRUE for any row of the sub-select,
the IN yields TRUE.

2. If "x = y" yields FALSE for all rows of the sub-select,
or the sub-select contains no rows, the IN yields FALSE.

3. Otherwise (which by elimination means "x = y" yielded UNKNOWN
for at least one row, and TRUE nowhere), the IN yields UNKNOWN.

What we need to know about, therefore, is not so much whether x or y is
null as whether the "=" operator yields null.

Also, IN supports multi-column comparisons --- eg,
(x1, x2) IN (SELECT y1, y2 FROM ...)
So the x or y value could be partially null as well as completely null.

I think I see how to do this in what will usually be a fairly efficient
fashion, but it requires assuming a good deal about the behavior of the
"=" operator(s) involved. The actual implementation has to be:

1. Store all-not-null subquery rows in hashtable (combining equal rows
into a single entry). Store partially or completely null rows in a
separate linear list, combining non-distinct rows into a single entry.

2a. For LHS row all null: result is FALSE if table is empty, else UNKNOWN.

2b. For LHS row partially-null: must compare against every element of
both hashtable and linear list, using same logic as for scanning
original subquery output (three-valued combination of OR of ANDs).
We know that we will not get a TRUE result, so we can stop as soon as
we find an UNKNOWN result for any row; this says it's worthwhile to scan
the linear list first, as that's most likely to give us an UNKNOWN.

2c. For LHS all-non-null: probe into hashtable for equality; result is
TRUE if match found. Otherwise, must scan list of partially-null rows
to see if any non-distinct rows are found; if so return UNKNOWN, else
return FALSE.

This should be reasonably efficient because the slow cases only occur
when there are partially-null LHS or RHS rows, which should be
relatively uncommon. (In particular, when there is only one column to
compare, case 2b doesn't arise at all, and there can never be more than
one entry in the linear list, namely RHS = null.) Also, the number of
distinct partially null rows should usually be much less than the number
of distinct no-nulls rows, so the linear list should be much smaller
than the hashtable. (I tried to think of ways to hash this list, but it
doesn't seem to work...)

Now, the assumptions this makes about the behavior of the "="
operator(s) are:

1. The operators are hashable (ie, only values producing the same
hashkey could possibly be equal). Else we can't use a hashtable at all.

2. The operators are strict (must yield NULL out for NULL in). This is
what lets us avoid a table scan in case 2a.

3. The operators do not yield NULL for non-null inputs. Otherwise,
after case 2c fails to find a TRUE result, we'd have to scan the whole
hashtable, not only the partially-null list, to see if there are rows
that must cause us to return UNKNOWN.

We already have markers in pg_operator and pg_proc to indicate
hashability and strictness, but there's no marker to indicate "reverse
strictness" as per assumption #3 (what's the proper term for that
property, anyway?)

I am leaning towards documenting that the oprcanhash marker implies
"reverse strictness" as well. All the currently hashable operators meet
that requirement. Although you could imagine cases where you might like
"=" to return null for non-null inputs (for example, a rigorous
interpretation of IEEE float math would probably say that a NaN compared
to anything else yields UNKNOWN), it's not really a very practical thing
to do in SQL --- one counterexample is that such a datatype could not be
btree-indexable, since btree assumes a total ordering is possible.

The other thing we could do is put a separate "reverse strict" marker
column into pg_proc, but this seems like much more work than is
justified.

Comments?

(Oh BTW: I'm aware that most of the problem goes away for IN appearing
at the top level of WHERE, since there we don't actually need to
distinguish FALSE from UNKNOWN results. But I'm trying to understand
how to optimize the general case, too.)

regards, tom lane

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Diego T. 2002-12-20 20:41:34
Previous Message Mikheev, Vadim 2002-12-20 19:45:57 PostgreSQL-R