Re: Okay to tighten definition of oprcanhash?

From: Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Okay to tighten definition of oprcanhash?
Date: 2002-12-20 22:09:10
Message-ID: 200212202209.gBKM9Bk22937@candle.pha.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers


I like the idea, and see need to keep a separate list of NULL values.

---------------------------------------------------------------------------

Tom Lane wrote:
> I have been looking into the possibility of using a hashtable to speed
> up "x IN (SELECT y FROM ...)" operations. Basically the idea is to run
> the subselect once, loading its "y" outputs into an in-memory hashtable
> (any duplicates can be discarded); then for each outer row, probe into
> the hashtable to see if there's a match for the outer "x" value.
>
> This relatively simple idea gets a lot messier as soon as you start to
> consider the fine points of SQL92 "IN" semantics, however. In
> particular, if x is null or any particular y is null, the result of the
> IN should be null ("unknown"), not false. What the spec really says is:
>
> 1. If "x = y" yields TRUE for any row of the sub-select,
> the IN yields TRUE.
>
> 2. If "x = y" yields FALSE for all rows of the sub-select,
> or the sub-select contains no rows, the IN yields FALSE.
>
> 3. Otherwise (which by elimination means "x = y" yielded UNKNOWN
> for at least one row, and TRUE nowhere), the IN yields UNKNOWN.
>
> What we need to know about, therefore, is not so much whether x or y is
> null as whether the "=" operator yields null.
>
> Also, IN supports multi-column comparisons --- eg,
> (x1, x2) IN (SELECT y1, y2 FROM ...)
> So the x or y value could be partially null as well as completely null.
>
> I think I see how to do this in what will usually be a fairly efficient
> fashion, but it requires assuming a good deal about the behavior of the
> "=" operator(s) involved. The actual implementation has to be:
>
> 1. Store all-not-null subquery rows in hashtable (combining equal rows
> into a single entry). Store partially or completely null rows in a
> separate linear list, combining non-distinct rows into a single entry.
>
> 2a. For LHS row all null: result is FALSE if table is empty, else UNKNOWN.
>
> 2b. For LHS row partially-null: must compare against every element of
> both hashtable and linear list, using same logic as for scanning
> original subquery output (three-valued combination of OR of ANDs).
> We know that we will not get a TRUE result, so we can stop as soon as
> we find an UNKNOWN result for any row; this says it's worthwhile to scan
> the linear list first, as that's most likely to give us an UNKNOWN.
>
> 2c. For LHS all-non-null: probe into hashtable for equality; result is
> TRUE if match found. Otherwise, must scan list of partially-null rows
> to see if any non-distinct rows are found; if so return UNKNOWN, else
> return FALSE.
>
> This should be reasonably efficient because the slow cases only occur
> when there are partially-null LHS or RHS rows, which should be
> relatively uncommon. (In particular, when there is only one column to
> compare, case 2b doesn't arise at all, and there can never be more than
> one entry in the linear list, namely RHS = null.) Also, the number of
> distinct partially null rows should usually be much less than the number
> of distinct no-nulls rows, so the linear list should be much smaller
> than the hashtable. (I tried to think of ways to hash this list, but it
> doesn't seem to work...)
>
> Now, the assumptions this makes about the behavior of the "="
> operator(s) are:
>
> 1. The operators are hashable (ie, only values producing the same
> hashkey could possibly be equal). Else we can't use a hashtable at all.
>
> 2. The operators are strict (must yield NULL out for NULL in). This is
> what lets us avoid a table scan in case 2a.
>
> 3. The operators do not yield NULL for non-null inputs. Otherwise,
> after case 2c fails to find a TRUE result, we'd have to scan the whole
> hashtable, not only the partially-null list, to see if there are rows
> that must cause us to return UNKNOWN.
>
> We already have markers in pg_operator and pg_proc to indicate
> hashability and strictness, but there's no marker to indicate "reverse
> strictness" as per assumption #3 (what's the proper term for that
> property, anyway?)
>
> I am leaning towards documenting that the oprcanhash marker implies
> "reverse strictness" as well. All the currently hashable operators meet
> that requirement. Although you could imagine cases where you might like
> "=" to return null for non-null inputs (for example, a rigorous
> interpretation of IEEE float math would probably say that a NaN compared
> to anything else yields UNKNOWN), it's not really a very practical thing
> to do in SQL --- one counterexample is that such a datatype could not be
> btree-indexable, since btree assumes a total ordering is possible.
>
> The other thing we could do is put a separate "reverse strict" marker
> column into pg_proc, but this seems like much more work than is
> justified.
>
> Comments?
>
> (Oh BTW: I'm aware that most of the problem goes away for IN appearing
> at the top level of WHERE, since there we don't actually need to
> distinguish FALSE from UNKNOWN results. But I'm trying to understand
> how to optimize the general case, too.)
>
> regards, tom lane
>
> ---------------------------(end of broadcast)---------------------------
> TIP 6: Have you searched our list archives?
>
> http://archives.postgresql.org
>

--
Bruce Momjian | http://candle.pha.pa.us
pgman(at)candle(dot)pha(dot)pa(dot)us | (610) 359-1001
+ If your life is a hard drive, | 13 Roberts Road
+ Christ can be your backup. | Newtown Square, Pennsylvania 19073

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Ryan Mahoney 2002-12-20 22:20:52 plpgsql and index usage
Previous Message Tom Lane 2002-12-20 21:48:39 Re: Okay to tighten definition of oprcanhash?