Re: BUG #6669: unique index w/ multiple columns and NULLs

From: "Kevin Grittner" <Kevin(dot)Grittner(at)wicourts(dot)gov>
To: "Bruce Momjian" <bruce(at)momjian(dot)us>
Cc: <pgsql-bugs(at)postgresql(dot)org>,"jo" <jose(dot)soares(at)sferacarta(dot)com>, "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Subject: Re: BUG #6669: unique index w/ multiple columns and NULLs
Date: 2012-06-04 16:26:20
Message-ID: 4FCC9B5C020000250004801D@gw.wicourts.gov
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

Bruce Momjian <bruce(at)momjian(dot)us> wrote:

> I get your point about COUNT(*) really counting rows, not values,
> but why doesn't GROUP BY then skip nulls?
>
> WITH null_test (col1, col2) AS
> (
> SELECT 1, null
> UNION ALL
> SELECT null, null
> )
> SELECT COUNT(*), col2 FROM null_test group by col2
> UNION ALL
> SELECT COUNT(col1), col2 FROM null_test group by col2;
>
> count | col2
> -------+------
> 2 |
> 1 |
> (2 rows)

I'm having trouble seeing why this result is confusing. You asked
for counts with a GROUP BY clause. In such a case, NULLs are a
group, since you might be interested in how many *are* null. Then
you did a count of all rows and a count of rows where another value
wasn't NULL. You got the only reasonable result, IMO. (Well,
unless you argue that a row with no known values should be an error
in the first place, which if I remember correctly is what E.F. Codd
argued for; but that is one point on which the standards committee
didn't go with Codd's position.)

> Looks like GROUP BY is selecting the NULL rows, then COUNT is
> processing them based on its rules.

I would tend to view it that COUNT is processing the rows it was
asked to process in each case, and GROUP BY is grouping them as
requested.

> I think the original complaint is that NULL != NULL in a WHERE
> clause, but GROUP BY is able to group them together just fine.

Whoa! I think I know what you meant, but that is a dangerously
misleading misstatement. It is not true that NULL = NULL, but
neither is it true that NULL != NULL. In fact, you also don't get
TRUE from NOT NULL = NULL. If you don't know either value, you
can't know that they are equal, and you can't know that they are
unequal. The results of such comparisons are UNKNOWN. That doesn't
mean you would always find the count of rows where the value is NULL
uninteresting; hence the IS NOT DISTINCT FROM concept is effectively
used for grouping.

Performing logical or set operations on data sets with missing
values is an inherently tricky business, but I think that overall
SQL has made reasonable choices on how to do that; my biggest gripe
is that there is no standard way to distinguish between UNKNOWN and
NOT APPLICABLE. The fuzziest areas seem to me to be related to
that deficiency. As long as NULL is not abused for such things as
"known to be zero" in an accounting record (which is subtly but
significantly different from "not applicable"), NULL is almost
always (IMO) better than some "magic value". If you have ever
converted data from a database where names were split into multiple
fields, and NULL was not allowed for middle name, you will probably
agree.

-Kevin

In response to

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message Bruce Momjian 2012-06-04 16:36:37 Re: BUG #6669: unique index w/ multiple columns and NULLs
Previous Message Bruce Momjian 2012-06-04 15:56:23 Re: BUG #6669: unique index w/ multiple columns and NULLs