Re: Removing Functionally Dependent GROUP BY Columns

From: Julien Rouhaud <julien(dot)rouhaud(at)dalibo(dot)com>
To: David Rowley <david(dot)rowley(at)2ndquadrant(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Marko Tiikkaja <marko(at)joh(dot)to>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Removing Functionally Dependent GROUP BY Columns
Date: 2016-01-23 11:56:57
Message-ID: 56A36A89.6050107@dalibo.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 23/01/2016 10:44, David Rowley wrote:
> On 23 January 2016 at 12:44, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>> I spent some time looking through this but decided that it needs some work
>> to be committable.
>>
>> The main thing I didn't like was that identify_key_vars seems to have a
>> rather poorly chosen API: it mixes up identifying a rel's pkey with
>> sorting through the GROUP BY list. I think it would be better to create
>> a function that, given a relid, hands back the OID of the pkey constraint
>> and a Bitmapset of the column numbers in the pkey (or 0/NULL if there's
>> no pkey or it's deferrable). The column numbers should be offset by
>> FirstLowInvalidHeapAttributeNumber so that a pkey on OID can be
>> represented correctly --- compare RelationGetIndexAttrBitmap().
>
> That seems like a much better design to me too. I've made changes and
> attached the updated patch.
>
>> The reason this seems like a more attractive solution is that the output
>> of the pg_constraint lookup becomes independent of the particular query
>> and thus is potentially cacheable (in the relcache, say). I do not think
>> we need to cache it right now but I'd like to have the option to do so.
>
> I've not touched that yet, but good idea.
>
>> (I wonder BTW whether check_functional_grouping couldn't be refactored
>> to use the output of such a function, too.)
>
> I've attached a separate patch for that too. It applies after the
> prunegroupby patch.
>

This refactoring makes the code much more better, +1 for me.

I also reviewed it, it looks fine. However, the following removed
comment could be kept for clarity:

- /* The PK is a subset of grouping_columns, so we win */

>> Some lesser points:
>>
>> * I did not like where you plugged in the call to
>> remove_useless_groupby_columns; there are weird interactions with grouping
>> sets as to whether it will get called or not, and also that whole chunk
>> of code is due for refactoring. I'd be inclined to call it from
>> subquery_planner instead, maybe near where the HAVING clause preprocessing
>> happens.
>
> I've moved the call to subquery_planner()
>
>> * You've got it iterating over every RTE, even those that aren't
>> RTE_RELATION RTEs. It's pure luck that identify_key_vars doesn't fail
>> outright when passed a bogus relid, and it's certainly wasteful.
>
> :-( I've fixed that now.
>
>> * Both of the loops iterating over the groupClause neglect to check
>> varlevelsup, thus leading to assert failures or worse if an outer-level
>> Var is present in the GROUP BY list. (I'm pretty sure outer Vars can
>> still be present at this point, though I might be wrong.)
>
> Fixed in the first loop, and the way I've rewritten the code to use
> bms_difference, there's no need to check again in the 2nd loop.
>
>> * I'd be inclined to see if the relvarlists couldn't be converted into
>> bitmapsets too. Then the test to see if the pkey is a subset of the
>> grouping columns becomes a simple and cheap bms_is_subset test. You don't
>> really need surplusvars either, because you can instead test Vars for
>> membership in the pkey bitmapsets that pg_constraint.c will be handing
>> back.
>
> I've changed to use a bitmapset now, but I've kept surplusvars, having
> this allows a single pass over the group by clause to remove the
> surplus Vars, rather than doing it again and again for each relation.
> I've also found a better way so that array is only allocated if some
> surplus Vars are found. I understand what you mean, and yes, it is
> possible to get rid of it, but I'd need to still add something else to
> mark that this rel's extra Vars are okay to be removed. I could do
> that by adding another bitmapset and marking the relid in that, but I
> quite like how I've changed it now as it saves having to check
> varlevelsup again on the Vars in the GROUP BY clause, as I just use
> bms_difference() on the originally found Vars subtracting off the PK
> vars, and assume all of those are surplus.
>

I wonder if in remove_useless_groupby_columns(), in the foreach loop you
could change the

+ if (bms_subset_compare(pkattnos, relattnos) == BMS_SUBSET1)
+ {

by something like

+ if (bms_num_members(relattnos) <= bms_num_members(pkattnos))
+ continue;
+
+ if (bms_is_subset(pkattnos, relattnos))
+ {

which may be cheaper.

Otherwise, I couldn't find any issue with this new version of the patch.

>> * What you did to join.sql/join.out seems a bit bizarre. The existing
>> test case that you modified was meant to test something else, and
>> conflating this behavior with the pre-existing one (and not documenting
>> it) is confusing as can be. A bit more work on regression test cases
>> seems indicated.
>
> The history behind that is that at one point during developing the
> patch that test had started failing due to the group by item being
> removed therefore allowing the join removal conditions to be met. On
> testing again with the old test query I see this no longer happens, so
> I've removed the change, although the expected output still differs
> due to the group by item being removed.
>
>> I'm going to set this back to "Waiting on Author". I think the basic
>> idea is good but it needs a rewrite.
>
> Thanks for the review and the good ideas.
>

--
Julien Rouhaud
http://dalibo.com - http://dalibo.org

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Peter Eisentraut 2016-01-23 12:03:30 Re: Improved tab completion for FDW DDL
Previous Message Amit Kapila 2016-01-23 10:58:32 Re: Relation extension scalability