Re: Removing Functionally Dependent GROUP BY Columns

From: David Rowley <david(dot)rowley(at)2ndquadrant(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Julien Rouhaud <julien(dot)rouhaud(at)dalibo(dot)com>, Marko Tiikkaja <marko(at)joh(dot)to>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Removing Functionally Dependent GROUP BY Columns
Date: 2016-01-23 09:44:08
Message-ID: CAKJS1f-2u-QZALRvFFf-H+HhxuakgBgLfAd4QiDWmeS8+3sPfA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 23 January 2016 at 12:44, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> I spent some time looking through this but decided that it needs some work
> to be committable.
>
> The main thing I didn't like was that identify_key_vars seems to have a
> rather poorly chosen API: it mixes up identifying a rel's pkey with
> sorting through the GROUP BY list. I think it would be better to create
> a function that, given a relid, hands back the OID of the pkey constraint
> and a Bitmapset of the column numbers in the pkey (or 0/NULL if there's
> no pkey or it's deferrable). The column numbers should be offset by
> FirstLowInvalidHeapAttributeNumber so that a pkey on OID can be
> represented correctly --- compare RelationGetIndexAttrBitmap().

That seems like a much better design to me too. I've made changes and
attached the updated patch.

> The reason this seems like a more attractive solution is that the output
> of the pg_constraint lookup becomes independent of the particular query
> and thus is potentially cacheable (in the relcache, say). I do not think
> we need to cache it right now but I'd like to have the option to do so.

I've not touched that yet, but good idea.

> (I wonder BTW whether check_functional_grouping couldn't be refactored
> to use the output of such a function, too.)

I've attached a separate patch for that too. It applies after the
prunegroupby patch.

> Some lesser points:
>
> * I did not like where you plugged in the call to
> remove_useless_groupby_columns; there are weird interactions with grouping
> sets as to whether it will get called or not, and also that whole chunk
> of code is due for refactoring. I'd be inclined to call it from
> subquery_planner instead, maybe near where the HAVING clause preprocessing
> happens.

I've moved the call to subquery_planner()

> * You've got it iterating over every RTE, even those that aren't
> RTE_RELATION RTEs. It's pure luck that identify_key_vars doesn't fail
> outright when passed a bogus relid, and it's certainly wasteful.

:-( I've fixed that now.

> * Both of the loops iterating over the groupClause neglect to check
> varlevelsup, thus leading to assert failures or worse if an outer-level
> Var is present in the GROUP BY list. (I'm pretty sure outer Vars can
> still be present at this point, though I might be wrong.)

Fixed in the first loop, and the way I've rewritten the code to use
bms_difference, there's no need to check again in the 2nd loop.

> * I'd be inclined to see if the relvarlists couldn't be converted into
> bitmapsets too. Then the test to see if the pkey is a subset of the
> grouping columns becomes a simple and cheap bms_is_subset test. You don't
> really need surplusvars either, because you can instead test Vars for
> membership in the pkey bitmapsets that pg_constraint.c will be handing
> back.

I've changed to use a bitmapset now, but I've kept surplusvars, having
this allows a single pass over the group by clause to remove the
surplus Vars, rather than doing it again and again for each relation.
I've also found a better way so that array is only allocated if some
surplus Vars are found. I understand what you mean, and yes, it is
possible to get rid of it, but I'd need to still add something else to
mark that this rel's extra Vars are okay to be removed. I could do
that by adding another bitmapset and marking the relid in that, but I
quite like how I've changed it now as it saves having to check
varlevelsup again on the Vars in the GROUP BY clause, as I just use
bms_difference() on the originally found Vars subtracting off the PK
vars, and assume all of those are surplus.

> * What you did to join.sql/join.out seems a bit bizarre. The existing
> test case that you modified was meant to test something else, and
> conflating this behavior with the pre-existing one (and not documenting
> it) is confusing as can be. A bit more work on regression test cases
> seems indicated.

The history behind that is that at one point during developing the
patch that test had started failing due to the group by item being
removed therefore allowing the join removal conditions to be met. On
testing again with the old test query I see this no longer happens, so
I've removed the change, although the expected output still differs
due to the group by item being removed.

> I'm going to set this back to "Waiting on Author". I think the basic
> idea is good but it needs a rewrite.

Thanks for the review and the good ideas.

--
David Rowley http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Attachment Content-Type Size
prune_group_by_clause_ab4f491_2016-01-23.patch application/octet-stream 16.3 KB
check_functional_grouping_refactor.patch application/octet-stream 3.2 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tomas Vondra 2016-01-23 10:22:08 Re: More stable query plans via more predictable column statistics
Previous Message Amit Kapila 2016-01-23 07:48:17 Re: Releasing in September