Re: pg, mysql comparison with "group by" clause

From: "Anthony Molinaro" <amolinaro(at)wgen(dot)net>
To: <gsstark(at)mit(dot)edu>
Cc: "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>, "Scott Marlowe" <smarlowe(at)g2switchworks(dot)com>, "Stephan Szabo" <sszabo(at)megazone(dot)bigpanda(dot)com>, "Rick Schumeyer" <rschumeyer(at)ieee(dot)org>, <pgsql-sql(at)postgresql(dot)org>
Subject: Re: pg, mysql comparison with "group by" clause
Date: 2005-10-13 05:28:25
Message-ID: 3C6C2B281FD3E74C9F7C9D5B1EDA4582182615@wgexch01.wgenhq.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-sql

Greg,

Ok, I think I see where you're going (I don't agree, but I think
I get you now).

So, using your example of:
"dept_name is guaranteed to be the same for all records with the
same dept_id."

Here:

select d.deptno,d.dname
from emp e, dept d
where e.deptno=d.deptno

DEPTNO DNAME
------ --------------
10 ACCOUNTING
10 ACCOUNTING
10 ACCOUNTING
20 RESEARCH
20 RESEARCH
20 RESEARCH
20 RESEARCH
20 RESEARCH
30 SALES
30 SALES
30 SALES
30 SALES
30 SALES
30 SALES

ok, so there's your scenario.

And you're suggesting that one should be able to
Do the following query?

select d.deptno,d.dname,count(*)
from emp e, dept d
where e.deptno=d.deptno

if that's what you suggest, then we'll just have to agree to disagree.

That query needs a group by. What you're suggesting is, imo,
a wholly unnecessary shortcut (almost as bad as that ridiculous "natural
join" - whoever came up with that should be tarred and feathered).

I think I see your point now, I just disagree.
Your depending on syntax to work based on data integrity?
Hmmm.... don't think I like that idea

What performance improvement do you get from leaving group by out?
Look at the query above, doesn't a count of distinct deptno,dname pairs
have
to take place anyway? What do you save by excluding group by?
Are you suggesting COUNT be computed for each row (windowed) or that
COUNT is computed for each group?

If you want repeating rows, then you want windowing.
For example:

select d.deptno,d.dname,count(*)over(partition by d.deptno,d.dname) cnt
from emp e, dept d
where e.deptno=d.deptno

DEPTNO DNAME CNT
------ -------------- ---
10 ACCOUNTING 3
10 ACCOUNTING 3
10 ACCOUNTING 3
20 RESEARCH 5
20 RESEARCH 5
20 RESEARCH 5
20 RESEARCH 5
20 RESEARCH 5
30 SALES 6
30 SALES 6
30 SALES 6
30 SALES 6
30 SALES 6
30 SALES 6

if you want "groups", then use group by:

select d.deptno,d.dname,count(*) cnt
from emp e, dept d
where e.deptno=d.deptno
group by d.deptno,d.dname

DEPTNO DNAME CNT
------ -------------- ---
10 ACCOUNTING 3
20 RESEARCH 5
30 SALES 6

what your suggesting doesn't seem to fit in at all,
particularly when pg implements window functions.

If you're suggesting the pg optimizer isn't doing the right thing
with group by queries, then this is an optimizer issue and
that should be hacked, not group by. If you're suggesting certain
rows be ditched or shortcuts be taken, then the optimizer should do
that, not the programmer writing sql.

Db2 and oracle have no problem doing these queries, I don't see
why pg should have a problem.

imo, the only items that should not be listed in the group by
are:

1. constants and deterministic functions
2. scalar subqueries
3. window functions

1 - because the value is same for each row
2&3 - because they are evaluated after the grouping takes place

regards,
Anthony

-----Original Message-----
From: gsstark(at)mit(dot)edu [mailto:gsstark(at)mit(dot)edu]
Sent: Thursday, October 13, 2005 12:25 AM
To: Anthony Molinaro
Cc: gsstark(at)mit(dot)edu; Tom Lane; Scott Marlowe; Stephan Szabo; Rick
Schumeyer; pgsql-sql(at)postgresql(dot)org
Subject: Re: [SQL] pg, mysql comparison with "group by" clause

"Anthony Molinaro" <amolinaro(at)wgen(dot)net> writes:

> Greg,
> You'll have to pardon me...
>
> I saw this comment:
>
> "I don't see why you think people stumble on this by accident.
> I think it's actually an extremely common need."
>
> Which, if referring to the ability to have items in the select that do
not
> need to be included in the group, (excluding constants and the like)
is just
> silly.

Well the "constants and the like" are precisely the point. There are
plenty of
cases where adding the column to the GROUP BY is unnecessary and since
Postgres makes no attempt to prune them out, inefficient. And constants
aren't
the only such case. The most common case is columns that are coming from
a
table where the primary key is already included in the GROUP BY list.

In the case of columns coming from a table where the primary key is
already in
the GROUP BY list it's possible for the database to deduce that it's
unnecessary to group on that column.

But it's also possible to have cases where the programmer has out of
band
knowledge that it's unnecessary but the database doesn't have that
knowledge.
The most obvious case that comes to mind is a denormalized data model
that
includes a redundant column.

select dept_id, dept_name, count(*) from employee_list

For example if dept_name is guaranteed to be the same for all records
with the
same dept_id. Of course that's generally considered poor design but it
doesn't
mean there aren't thousands of databases out there with data models like
that.

--
greg

Browse pgsql-sql by date

  From Date Subject
Next Message NSO 2005-10-13 10:26:33 Storing images from Delphi to postgresql
Previous Message Tom Lane 2005-10-13 04:39:10 Re: UPDATE Trigger on multiple tables