Re: [SQL] DISTINCT ON: speak now or forever hold your peace

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Julian Scarfe <jscarfe(at)callnetuk(dot)com>
Cc: pgsql-hackers(at)postgreSQL(dot)org, pgsql-sql <pgsql-sql(at)postgreSQL(dot)org>
Subject: Re: [SQL] DISTINCT ON: speak now or forever hold your peace
Date: 2000-01-25 17:16:46
Message-ID: 13995.948820606@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers pgsql-sql

Julian Scarfe <jscarfe(at)callnetuk(dot)com> writes:
>>> I use 'distinct on' to get the most recent reports for each of a group of
>>> locations. E.g.:
>>>
>>> select distinct on location * from reports where location ~~ 'Lond%'
>>> order by location, reporttime desc;
>>>
>>> to get the tuples that offer the most recent reports for each of London,
>>> Londonderry, Londy etc.
>>>
>>> Is there an alternative?
>>
>> select location, max(report_time) from reports group by location

> But I want report_text *too* -- that's the important bit, and it's not an
> aggregate or common column in the group.

Right. That is the motivation that was given for keeping DISTINCT ON
and trying to clean it up: if you ORDER BY the DISTINCT field plus
enough other fields to give a unique ordering, and then rely on the
(undocumented) implementation property that you'll get the first tuple
out of each group with the same DISTINCT field, then you can get useful
effects like this one.

It's a kluge, and it's not readily extensible to cases such as needing to
apply the DISTINCT filter across two fields, but it does solve problems
that you can't straightforwardly rewrite into GROUP BY.

Tell you what: there are two things that are really bugging me about
DISTINCT ON. If I could get consensus on patching these things,
I'd be willing to do that rather than yank it.

One gripe is the lame-brain syntax: because there is no separator
between the DISTINCT ON column name and the first select targetlist
item, the DISTINCT ON target cannot be anything more complex than an
unadorned column name. For example "SELECT DISTINCT ON A+B+C, ..."
must be interpreted as DISTINCT ON A with a first target of +B+C;
if we tried to allow more complex expressions then it becomes ambiguous
whether that was meant or DISTINCT ON A+B with a first target of +C.
To fix this we need some kind of syntactic separator. The cleanest
idea that comes to my mind is to require parentheses around the ON
target:
SELECT DISTINCT ON (expression) target1, ...
One immediate advantage of allowing an expression is that you can do
distinct-on-two-fields in a rather klugy way, eg
SELECT DISTINCT ON (field1 || ' ' || field2) ...
We might someday extend it to allow multiple DISTINCT fields, eg,
SELECT DISTINCT ON (expr1 [, expr2 ...]) target1, ...
but I'm not promising to do that now (unless it's really easy ;-)).

The other gripe is the bad interaction with ORDER BY, as was illustrated
in my example of last night: if you ORDER BY something that's not the
DISTINCT field, it doesn't work. We could get something that kind of
works by applying two sorting passes: first sort by the DISTINCT field,
then run the duplicate-elimination filter, then re-sort by the ORDER BY
field(s). The trouble with that is that it means the user no longer has
control of the sort order within the same-DISTINCT-field-value groups
during the first sort phase, so there's no way for him to control which
tuple gets kept from each group. What I'd prefer to do is put in an
error check that says "if you use both DISTINCT ON and ORDER BY, then
the DISTINCT ON expression must be the first ORDER BY item". That way
the ORDER BY ordering can be the same one used for the
duplicate-elimination pass, and we don't have the semantic problem.
Note that Julian's example meets this constraint.

(BTW, ordinary SELECT DISTINCT has this same sort of problem if you try
to ORDER BY an expression that doesn't appear in the target list.
SQL92 avoids the issue by not allowing you to ORDER BY expressions that
aren't in the target list, period. We do allow that --- but not when
you use DISTINCT. Essentially, I want to enforce that same restriction
for DISTINCT ON.)

The other piece of the puzzle would be to document that DISTINCT ON
keeps the first tuple out of each set with the same DISTINCT ON value.

Does that sound like a plan?

regards, tom lane

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2000-01-25 17:23:15 Re: Happy column adding (was RE: [HACKERS] Happy column dropping)
Previous Message Ross J. Reedstrom 2000-01-25 16:37:11 Re: [HACKERS] Happy column dropping

Browse pgsql-sql by date

  From Date Subject
Next Message Tom Lane 2000-01-25 17:18:21 Re: [SQL] Problem with large tuples.
Previous Message Tom Lane 2000-01-25 16:26:12 Re: [SQL] Duplicate tuples with unique index