Re: [HACKERS] Removing useless DISTINCT clauses

From: David Rowley <david(dot)rowley(at)2ndquadrant(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Jeff Janes <jeff(dot)janes(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [HACKERS] Removing useless DISTINCT clauses
Date: 2018-01-09 22:12:17
Message-ID: CAKJS1f8=KepAZ-UOYnqRfAdwOKWEvt=3vbe4_3QBCmQuHbYKmg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi Tom,

Thanks for looking at this.

On 10 January 2018 at 09:26, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> This is a cute idea, but I'm troubled by a couple of points:
>
> 1. Once you don't have all the tlist items shown in DISTINCT, it really is
> more like DISTINCT ON, seems like. I am not sure it's a good idea to set
> hasDistinctOn, because that engages some planner behaviors we probably
> don't want, but I'm also not sure we can get away with just ignoring the
> difference. As an example, in allpaths.c there are assorted assumptions
> that having a distinctClause but !hasDistinctOn means all output columns
> of a subquery are listed in the distinctClause.

hmm, I see:

/*
* If subquery has regular DISTINCT (not DISTINCT ON), we're wasting our
* time: all its output columns must be used in the distinctClause.
*/
if (subquery->distinctClause && !subquery->hasDistinctOn)
return;

I think that particular one would just fail to remove columns from the
subquery if there's a distinct clause (not distinct on). That seems
more like a missed optimisation rather than a future bug. Although it
probably would be a good idea to get rid of the assumption that a
non-NIL distinctClause && !hasDistinctOn means that all output target
items are part of that distinct clause.

I don't see anything else in allpaths.c that would be affected.
Keeping in mind we'll never completely remove a distinctClause, just
possibly remove some items from the list. Everything else just seems
to be checking distinctClause != NIL, or is only doing something if
hasDistinctOn == true.

I'll do some more analysis on places that distinctClause is being used
to check what's safe.

> 2. There's a comment in planner.c to the effect that
>
> * When we have DISTINCT ON, we must sort by the more rigorous of
> * DISTINCT and ORDER BY, else it won't have the desired behavior.
> * Also, if we do have to do an explicit sort, we might as well use
> * the more rigorous ordering to avoid a second sort later. (Note
> * that the parser will have ensured that one clause is a prefix of
> * the other.)
>
> Removing random elements of the distinctClause will break its
> correspondence with the sortClause, with probably bad results.
>
> I do not remember for sure at the moment, but it may be that this
> correspondence is only important for the case of DISTINCT ON, in which
> case we could dodge the problem by not applying the optimization unless
> it's plain DISTINCT. That doesn't help us with point 1 though.

Yeah, it seems better to only try and apply this for plain DISTINCT.

> BTW, my dictionary says it's "dependent" not "dependant".

Oops. Thanks.

--
David Rowley http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message David Rowley 2018-01-09 22:14:27 Re: [HACKERS] Removing LEFT JOINs in more cases
Previous Message Tom Lane 2018-01-09 22:05:49