Re: [HACKERS] Removing useless DISTINCT clauses

From: David Rowley <david(dot)rowley(at)2ndquadrant(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Jeff Janes <jeff(dot)janes(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [HACKERS] Removing useless DISTINCT clauses
Date: 2018-03-04 14:02:22
Message-ID: CAKJS1f9Z6-_xFFNuA5z5c7im3CbH_pBukzxVgo0sX=8zZjKhVQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 10 January 2018 at 09:26, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> 1. Once you don't have all the tlist items shown in DISTINCT, it really is
> more like DISTINCT ON, seems like. I am not sure it's a good idea to set
> hasDistinctOn, because that engages some planner behaviors we probably
> don't want, but I'm also not sure we can get away with just ignoring the
> difference. As an example, in allpaths.c there are assorted assumptions
> that having a distinctClause but !hasDistinctOn means all output columns
> of a subquery are listed in the distinctClause.

Looking through allpaths.c for usages of distinctClauses it seems
they're all either interested in distinctClauses being non-NIL, which
we'll have no effect on. There's one that only cares about distinctOn
only, and then there's:

/*
* If subquery has regular DISTINCT (not DISTINCT ON), we're wasting our
* time: all its output columns must be used in the distinctClause.
*/
if (subquery->distinctClause && !subquery->hasDistinctOn)
return;

I don't think removing this fast-path would currently have any benefit
as we're not setting the ressortgroupref for the targetlist items
which we've removed the distinctClause for.

If we did that then, you might think that the following code in
remove_unused_subquery_outputs() might be able to remove the
targetlist item for a useless distinct item we've removed:

/*
* If it has a sortgroupref number, it's used in some sort/group
* clause so we'd better not remove it. Also, don't remove any
* resjunk columns, since their reason for being has nothing to do
* with anybody reading the subquery's output. (It's likely that
* resjunk columns in a sub-SELECT would always have ressortgroupref
* set, but even if they don't, it seems imprudent to remove them.)
*/
if (tle->ressortgroupref || tle->resjunk)
continue;

but, it can't because we've not removed them yet.
remove_unused_subquery_outputs() is called before subquery_planner()
in set_subquery_pathlist(), so remove_useless_distinct_columns() won't
even have been called by the time we get to the above code. So it
looks like zeroing the ressortgroupref's of the TargetEntry items
belonging to the removed distinct items would be wasted effort. I also
can't imagine why we'd ever try to remove unused outputs from a
subquery that's already been planned since almost all of the benefits
are from allowing the planner more flexibility to do things like join
removal, so I don't think there's any danger of this code becoming
broken later.

I looked in other places in the planner which are looking at
distinctClauses. Many are just testing distinctClauses == NIL, which
are not affected here since we'll always leave at least 1 item in that
List.

Another place that looks interesting is query_is_distinct_for() in
analyzejoins.c. However, the subquery is being analyzed here long
before it's getting planned, so all the distinctClauses will be fully
intact at that time.

> 2. There's a comment in planner.c to the effect that
>
> * When we have DISTINCT ON, we must sort by the more rigorous of
> * DISTINCT and ORDER BY, else it won't have the desired behavior.
> * Also, if we do have to do an explicit sort, we might as well use
> * the more rigorous ordering to avoid a second sort later. (Note
> * that the parser will have ensured that one clause is a prefix of
> * the other.)
>
> Removing random elements of the distinctClause will break its
> correspondence with the sortClause, with probably bad results.
>
> I do not remember for sure at the moment, but it may be that this
> correspondence is only important for the case of DISTINCT ON, in which
> case we could dodge the problem by not applying the optimization unless
> it's plain DISTINCT. That doesn't help us with point 1 though.

Good point, although we could probably just apply the same
optimization to the ORDER BY to sidestep that. I've left a note in
that area as something to think about for another day, although I
think the refactoring done in this patch would make it a pretty easy
thing to fix.

> BTW, my dictionary says it's "dependent" not "dependant".

You're right, regardless if that dictionary says Webster's or Oxford
on the front. Fixed.

I've attached an updated patch.

--
David Rowley http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Attachment Content-Type Size
remove_useless_distinct_clauses_v3.patch application/octet-stream 12.3 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Michael Banck 2018-03-04 14:24:18 Re: Online enabling of checksums
Previous Message Emre Hasegeli 2018-03-04 13:12:49 Re: constraint exclusion and nulls in IN (..) clause