Re: AW: Re: [SQL] behavior of ' = NULL' vs. MySQL vs. Stand ards

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: "Joe Conway" <joseph(dot)conway(at)home(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: AW: Re: [SQL] behavior of ' = NULL' vs. MySQL vs. Stand ards
Date: 2001-06-23 20:58:37
Message-ID: 8714.993329917@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

"Joe Conway" <joseph(dot)conway(at)home(dot)com> writes:
> Attached is my initial attempt to teach clause_selectivity about BooleanTest
> nodes, for review and comment. Please let me know if I'm headed in the right
> direction.

Looks like you're headed the right way. A few comments ---

1. Don't forget to add a routine for NullTest nodes, too. These can be
simpler, since you know that all you care about is the stanullfrac; no
need to look at common values.

2. I don't see any real good reason to force booltestsel() (likewise
nulltestsel(), when you add it) to have the same call signature as the
various operator selectivity routines. Those have to be looked up in a
table and called via fmgr, but there's no reason to think that
booltestsel will ever be called in the same way. We could certainly
lose the useless "operator" argument. I'd be inclined to declare
booltestsel as

Selectivity booltestsel(Query *root,
BooleanTest *clause,
int varRelid);

and skip *all* the fmgr notational cruft. The only trouble with this is
we'd not want to put such a declaration into builtins.h, because it'd
force including a bunch more .h files into builtins.h, which is ugly
and might even cause some circularity problems. There are already some
selectivity-related things in builtins.h that don't really belong there.
I am starting to think that selfuncs.c deserves its own header file
separate from builtins.h, so that it doesn't clutter builtins.h with a
bunch of selectivity-specific stuff. (Comments anyone?)

3. You should be using tests like "if (DatumGetBool(values[i]))" or
"if (!DatumGetBool(values[i]))" to examine the MCV-list values, not
"if (values[i] == true)". The latter is a type cheat: it assumes that
Datum can be implicitly converted to bool, which is only sort-of true at
the moment and might be not-at-all-true in the future. You could write
"if (DatumGetBool(values[i]) == true)" and satisfy the strict-typing
part of this concern, but there's also a more stylistic thing here.
Personally I find "if (boolean == true)" or "if (boolean == false)"
to be poor style, it should be just "if (boolean)" or "if (!boolean)".
This gets back to whether you consider boolean to be a logically
distinct type from int, I suppose. Feel free to ignore that part
if you strongly disagree.

4. I don't think that DEFAULT_EQ_SEL is an appropriate default result
for this routine. Arguably, it should elog(ERROR) if not passed a
BooleanTest node, so the first one or two occurrences are nonissues
anyway. However, if you find a non-Var argument or a Var that you
can't get statistics for, you do need to return some sort of reasonable
default, and that I think is not the best. In any case the default should
take the test type into account. For IS_NULL and IS_UNKNOWN, I'd think
the default assumption should be small but not zero (0.01 maybe? Any
thoughts out there?). IS_NOT_NULL/UNKNOWN should obviously yield one
minus whatever we use for IS_NULL. It seems reasonable to use 0.5
as default for IS_TRUE and the other three BooleanTests.

5. Actually, for IS_TRUE and the other three BooleanTests, what you
should probably do with a non-Var input is ignore the possibility of a
NULL value and just try to estimate the probability of TRUE vs FALSE,
which you can do by recursing back to clause_selectivity() on the
argument (note this would be invalid for NullTest where the argument
isn't necessarily boolean, but it's legit for BooleanTests). Then
you use either that result or one minus that result, depending on
which BooleanTest you're dealing with.

6. The way you are working with the MCV values seems unnecessarily
complicated. I had a hard time seeing whether it was generating the
right answer (actually, it demonstrably isn't, since for example you
produce the same result for IS_TRUE and IS_NOT_FALSE, which ought to
differ by the stanullfrac amount; but it's too complex anyway).
I'd be inclined to do this in one of two ways:
1. Scan the MCV array to find the entry for "true", and work
with its frequency and the stanullfrac frequency to derive
the appropriate answer depending on the test type.
2. Always use the first (most common val's) frequency. This
value is either true or false, so adjust accordingly.
The simplest way would be to derive true's frequency as
either freq[0] or 1 - freq[0] - stanullfrac, and then
proceed as in #1.
#2 is a little more robust, since it would still work if for some reason
the MCV list contains only an entry for true or only an entry for false.
(I believe that could happen, if the column contains only one value and
possibly some nulls.)

You're off to an excellent start though; you seem to be finding your
way through the code very well.

regards, tom lane

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Hiroshi Inoue 2001-06-23 21:11:01 RE: Good name for new lock type for VACUUM?
Previous Message Joe Conway 2001-06-23 19:11:37 Re: AW: Re: [SQL] behavior of ' = NULL' vs. MySQL vs. Stand ards