Re: Side effect of remove_useless_groupby_columns

From: Richard Guo <guofenglinux(at)gmail(dot)com>
To: David Rowley <dgrowleyml(at)gmail(dot)com>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Side effect of remove_useless_groupby_columns
Date: 2021-03-01 06:36:33
Message-ID: CAMbWs49V5d6nz4KM=nQCSDgzHFo-0pA3+S1N_QUGUeB5NN+pnA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

On Sun, Feb 28, 2021 at 5:15 PM David Rowley <dgrowleyml(at)gmail(dot)com> wrote:

> On Sun, 28 Feb 2021 at 20:52, Richard Guo <guofenglinux(at)gmail(dot)com> wrote:
> > When looking at [1], I realized we may have a side effect when removing
> > redundant columns in the GROUP BY clause. Suppose we have a query with
> > ORDER BY 'b', and meanwhile column 'b' is also a group key. If we decide
> > that 'b' is redundant due to being functionally dependent on other GROUP
> > BY columns, we would remove it from group keys. This will make us lose
> > the opportunity to leverage the index on 'b'.
>
> > Any thoughts?
>
> I had initially thought that this was a non-issue before incremental
> sort was added, but the following case highlights that's wrong.
>
> # create table t (a int not null, b int);
> # insert into t select i, i%1000 from generate_series(1,1000000)i;
> # create index on t(b,a);
>
> # explain (analyze) select b from t group by a, b order by b,a limit 10;
> # alter table t add constraint t_pkey primary key (a);
> # explain (analyze) select b from t group by a, b order by b,a limit 10;
>
> Execution slows down after adding the primary key since that allows
> the functional dependency to be detected and the "b" column removed
> from the GROUP BY clause.
>
> Perhaps we could do something like add:
>
> diff --git a/src/backend/optimizer/plan/planner.c
> b/src/backend/optimizer/plan/planner.c
> index 545b56bcaf..7e52212a13 100644
> --- a/src/backend/optimizer/plan/planner.c
> +++ b/src/backend/optimizer/plan/planner.c
> @@ -3182,7 +3182,8 @@ remove_useless_groupby_columns(PlannerInfo *root)
> if (!IsA(var, Var) ||
> var->varlevelsup > 0 ||
> !bms_is_member(var->varattno -
> FirstLowInvalidHeapAttributeNumber,
> -
> surplusvars[var->varno]))
> +
> surplusvars[var->varno]) ||
> + list_member(parse->sortClause, sgc))
> new_groupby = lappend(new_groupby, sgc);
> }
>
> to remove_useless_groupby_columns(). It's not exactly perfect though
> since it's still good to remove the useless GROUP BY columns if
> there's no useful index to implement the ORDER BY. We just don't know
> that when we call remove_useless_groupby_columns().
>

Or, rather than thinking it as a side effect of removing useless groupby
columns, how about we do an additional optimization as 'adding useful
groupby columns', to add into group keys some column which matches ORDER
BY and meanwhile is being functionally dependent on other GROUP BY
columns. What I'm thinking is a scenario like below.

# create table t (a int primary key, b int, c int);
# insert into t select i, i%1000, i%1000 from generate_series(1,1000000)i;
# create index on t(b);

# explain (analyze) select b from t group by a, c order by b limit 10;

For this query, we can first remove 'c' from groupby columns with the
existing optimization of 'removing useless groupby columns', and then
add 'b' to groupby columns with the new-added optimization of 'adding
useful groupby columns'.

We can do that because we know 'b' is functionally dependent on 'a', and
we are required to be sorted by 'b', and meanwhile there is index on 'b'
that we can leverage.

Thanks
Richard

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Peter Smith 2021-03-01 06:44:43 Re: [HACKERS] logical decoding of two-phase transactions
Previous Message Amit Langote 2021-03-01 06:31:23 Re: Parallel INSERT (INTO ... SELECT ...)