Fixing the representation of ORDER BY/GROUP BY/DISTINCT

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: pgsql-hackers(at)postgreSQL(dot)org
Subject: Fixing the representation of ORDER BY/GROUP BY/DISTINCT
Date: 2008-08-01 00:54:49
Message-ID: 25811.1217552089@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

So while I was fooling with Steve Midgley's problem I got a bit of a bee
in my bonnet about the way that the parser emits ORDER BY, GROUP BY,
and DISTINCT lists.

* Currently, ORDER BY and DISTINCT use lists of SortClause, while GROUP
BY is a list of GroupClause --- but these are actually the same struct,
and there's a fair amount of code that relies on that. The advantage of
them being the same is that it's easy to compare them when considering
sort-and-uniq-style grouping plans. Except it's not easy enough: I
tried to use a list_member test in one place, and of course it didn't
work because equal() never sees distinct node tags as equal. So I'm
thinking we ought to unify the two node types logically as well as
physically, and just have one node type (SortGroupClause, maybe).

* The current representation is fine for ORDER BY but leaves something
to be desired for GROUP BY and DISTINCT: there isn't anyplace to specify
the equality operator for a hash-based grouping operation. This results
in repeat lookups in the planner to fetch information that was readily
available at parse time. But what's worse IMHO is that we simply cannot
represent a grouping query for a datatype that hasn't got a btree sort
opclass --- even though we could implement it, by means of hashing, if
the type has a hashable equality operator. (This isn't academic: it's
easy to imagine datatypes that have equality but no natural linear sort
order. A practical example is XID, which in fact has a hash opclass
but not a btree opclass, because it violates the law of transitivity.)
So what I'm thinking we want is something like

typedef struct SortGroupClause
{
NodeTag type;
Index tleSortGroupRef; /* reference into targetlist */
Oid eqop; /* the equality operator ('=' op) */
Oid sortop; /* the ordering operator ('<' op), or 0 */
bool nulls_first; /* do NULLs come before normal values? */
} SortGroupClause;

In an ORDER BY item the sortop and nulls_first flag *must* be supplied.
The eqop isn't really useful for ORDER BY, but it's trivial to get when
we get the sortop, and always including it simplifies comparisons to
GROUP/DISTINCT items. In GROUP BY/DISTINCT items, the eqop *must* be
supplied, and if it is a btree equality operator then the associated
sortop should be given as well. We'd continue the current practice of
duplicating the ordering properties of any ORDER BY clause given for the
same targetlist item, so that compatible ordering and grouping items are
just equal().

* Another thing I've gotten tired of is the difficulty of telling
DISTINCT from DISTINCT ON in the parsed representation. Surely we can
afford to stick another bool field into Query to make that distinction
unambiguous. This is important for making the world safe for hashed
DISTINCT, since AFAICS we probably can't ever use hashing for DISTINCT
ON --- its definition is too dependent on the assumption of sorting.

Barring objections, I'm going to go off and make this happen. It won't
immediately result in supporting hashed DISTINCT, but it's another
necessary step on the road to that.

regards, tom lane

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message ITAGAKI Takahiro 2008-08-01 02:53:25 Re: Copy storage parameters on CREATE TABLE LIKE/INHERITS
Previous Message Alvaro Herrera 2008-07-31 22:19:29 Re: Review: DTrace probes (merged version) ver_03