Re: Removing Functionally Dependent GROUP BY Columns

From: Julien Rouhaud <julien(dot)rouhaud(at)dalibo(dot)com>
To: David Rowley <david(dot)rowley(at)2ndquadrant(dot)com>, Marko Tiikkaja <marko(at)joh(dot)to>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Removing Functionally Dependent GROUP BY Columns
Date: 2016-01-12 10:37:54
Message-ID: 5694D782.8010401@dalibo.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 01/12/2015 12:07, David Rowley wrote:
> On 1 December 2015 at 17:09, Marko Tiikkaja <marko(at)joh(dot)to
> <mailto:marko(at)joh(dot)to>> wrote:
>
> On 2015-12-01 05:00, David Rowley wrote:
>
> We already allow a SELECT's target list to contain
> non-aggregated columns
> in a GROUP BY query in cases where the non-aggregated column is
> functionally dependent on the GROUP BY clause.
>
> For example a query such as;
>
> SELECT p.product_id,p.description, SUM(s.quantity)
> FROM product p
> INNER JOIN sale s ON p.product_id = s.product_id
> GROUP BY p.product_id;
>
> is perfectly fine in PostgreSQL, as p.description is
> functionally dependent
> on p.product_id (assuming product_id is the PRIMARY KEY of product).
>
>
> This has come up before (on other forums, at least), and my main
> concern has been that unlike the case where we go from throwing an
> error to allowing a query, this has a chance to make the planning of
> currently legal queries slower. Have you tried to measure the
> impact of this on queries where there's no runtime gains to be had?
>
>
> I've performed a series of benchmarks on the following queries:
>
> Test1: explain select id1,id2 from t1 group by id1,id2;
> Test2: explain select id from t2 group by id;
> Test3: explain select t1.id1,t1.id2 from t2 inner join t1 on
> t1.id1=t2.id <http://t2.id> group by t1.id1,t1.id2;
>
> I ran each of these with pgbench for 60 seconds, 3 runs per query. In
> each case below I've converted the TPS into seconds using the average
> TPS over the 3 runs.
>
> In summary:
>
> Test1 is the worst case test. It's a very simple query so planning
> overhead of join searching is non-existent. The fact that there's 2
> columns in the GROUP BY means that the fast path cannot be used. I added
> this as if there's only 1 column in the GROUP BY then there's no point
> in searching for something to remove.
>
> [...]
>
> It seems that the worst case test adds about 7.6 microseconds onto
> planning time. To get this worse case result I had to add two GROUP BY
> columns, as having only 1 triggers a fast path as the code knows it
> can't remove any columns, since there's only 1. A similar fast path also
> exists which will only lookup the PRIMARY KEY details if there's more
> than 1 column per relation in the GROUP BY, so for example GROUP BY
> rel1.col1, rel2.col1 won't lookup any PRIMARY KEY constraint.
>
> Given that the extra code really only does anything if the GROUP BY has
> 2 or more expressions, are you worried that this will affect too many
> short and fast to execute queries negatively?
>
>

In identify_key_vars()

+ /* Only PK constraints are of interest for now, see comment above */
+ if (con->contype != CONSTRAINT_PRIMARY)
+ continue;

I think the comment should better refer to
remove_useless_groupby_columns() (don't optimize further than what
check_functional_grouping() does).

+ /*
+ * Exit if the constraint is deferrable, there's no point in looking
+ * for another PK constriant, as there can only be one.
+ */

small typo on "constriant"

+ {
+ varlistmatches = bms_add_member(varlistmatches,
varlistidx);
+ found_col = true;
+ /* don't break, there might be dupicates */
+ }

small typo on "dupicates". Also, I thought transformGroupClauseExpr()
called in the parse analysis make sure that there isn't any duplicate in
the GROUP BY clause. Do I miss something?

in remove_useless_groupby_columns()

+ /*
+ * Skip if there were no Vars found in the GROUP BY clause which
belong
+ * to this relation. We can also skip if there's only 1 Var, as
there
+ * needs to be more than one Var for there to be any columns
which are
+ * functionally dependent on another column.
+ */

This comment doesn't sound very clear. I'm not a native english speaker,
so maybe it's just me.

+ /*
+ * If we found any surplus Vars in the GROUP BY clause, then we'll build
+ * a new GROUP BY clause without these surplus Vars.
+ */
+ if (anysurplus)
+ {
+ List *new_groupby = NIL;
+
+ foreach (lc, root->parse->groupClause)
+ {
+ SortGroupClause *sgc = (SortGroupClause *) lfirst(lc);
+ TargetEntry *tle;
+ Var *var;
+
+ tle = get_sortgroupclause_tle(sgc, root->parse->targetList);
+ var = (Var *) tle->expr;
+
+ if (!IsA(var, Var))
+ continue;
+ [...]

if var isn't a Var, it needs to be added in new_groupby.

+ /* XXX do we to alter tleSortGroupRef to remove gaps? */

no idea on that :/

--
Julien Rouhaud
http://dalibo.com - http://dalibo.org

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Vladimir Sitnikov 2016-01-12 10:44:07 Re: Re: 9.4-1207 behaves differently with server side prepared statements compared to 9.2-1102
Previous Message Teodor Sigaev 2016-01-12 09:36:20 Re: Fuzzy substring searching with the pg_trgm extension