Fixing DISTINCT ON for duplicate keys

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: pgsql-hackers(at)postgreSQL(dot)org
Subject: Fixing DISTINCT ON for duplicate keys
Date: 2008-07-31 17:38:37
Message-ID: 16673.1217525917@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

I looked into this trouble report:
http://archives.postgresql.org/pgsql-sql/2008-07/msg00123.php

The problem is that by the time we get to transformDistinctClause(),
any duplicate entries in the ORDER BY list have been eliminated
(see addTargetToSortList). But transformDistinctClause expects
a one-for-one match of the two lists, and so it complains.

Clearly, duplicate DISTINCT ON items are just as redundant as duplicate
ORDER BY items are, and so it seems that suppressing them is a
reasonable thing to do. But I'm thinking that as long as we're touching
this old code, there are some other things that should be fixed:

* There's not really any semantic significance to the ordering of the
DISTINCT ON list anyway, so it would be reasonable to rearrange the
ordering of the list to match the ORDER BY list, rather than making
the user do it.

* It's really bletcherous that the code physically modifies the
user-given ORDER BY. This damage is visible in stored rules ---
they don't come out the same way you wrote them. While I don't
mind the idea of dropping redundant entries, adding ORDER BY entries
that the user never wrote seems bogus. It overconstrains the query,
in a way that doesn't matter given our current implementation but
could matter in the future.

What I am thinking we could do about the latter is modify the querytree
semantics a bit. Instead of insisting that the transformed
distinctClause be equal to a prefix of the sortClause, allow either
one to be a prefix of the other. Then the planner simply takes the
longer one as its internal sort-order target. With this rule, the
sortClause stays as what the user wrote (less any duplicate keys).
The parser is required to remove any duplicate keys from the
distinctClause and rearrange it if needed so that it has a common
prefix with the sortClause (or throw error if this is not possible).
This would be invisible to the user in plain SELECT DISTINCT, and
in SELECT DISTINCT ON would mean that the list is dumped in a
"canonical" order that matches ORDER BY, but isn't changed in any
semantic way.

Now this is probably too big a change to be prudent to back-patch.
Is it worth coming up with a second patch that just tries to get
transformDistinctClause to remove duplicates? Since the problem
has existed for a very long time (at least back to 7.0 according
to my testing) with no prior reports, it doesn't seem very important
to fix. I'm a bit worried about putting a patch into only the
back branches --- it would go out with little testing and so the
odds of introducing a fresh problem seem uncomfortably high compared
to the benefit.

Comments?

regards, tom lane

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message David E. Wheeler 2008-07-31 17:42:00 Re: Type Categories for User-Defined Types
Previous Message Alvaro Herrera 2008-07-31 17:25:00 Re: pg_regress inputdir