Re: Removing useless DISTINCT clauses

From: David Rowley <david(dot)rowley(at)2ndquadrant(dot)com>
To: Stephen Frost <sfrost(at)snowman(dot)net>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, "Finnerty, Jim" <jfinnert(at)amazon(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Removing useless DISTINCT clauses
Date: 2018-08-24 03:12:12
Message-ID: CAKJS1f862Fh+v0Aw1=h7bOz0j1qBp-dozXrGOz7pkXVJ5BNx_w@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 24 August 2018 at 14:12, Stephen Frost <sfrost(at)snowman(dot)net> wrote:
> This is happening at the same time as some optimizations around GROUP
> BY, so either there's something different about what's happening there
> and I didn't appreciate it, or does that optimization suffer from a
> similar issue?

There are two separate optimisations in Jim's version of the patch:

1) Attempting to remove useless DISTINCT clause items in cases where a
true subset of the clauses make up the entire table's primary key.
This can be applied no matter how many joins are in the query. If a
table's PK is a,b then it's not going to be any more distinct if we
distinctify on a,b,c. Distification on a,b is the most we'll never
need. We already apply this optimisation to the GROUP BY clause.

2) Attempt to remove the entire DISTINCT clause if the query is to a
single base table when the DISTINCT clause contains the entire table's
PK. Here there's need uniquify the results.

I started this thread for #1 and Jim would like to add #2 as an
additional optimisation. I feel that #2 is different enough from #1
that they should be considered independently of each other. The area
where #2 should be implemented is far away at the other end of
planning from where #1 is being done. I believe that two separate
patches are the best way forward here as it does not really make sense
to reject #1 because there are concerns with #2, vice versa. I think
if Jim would like to propose #2, then I think a separate thread is a
better option than tagging along on this one. This might also reduce
confusion.

I'd also rather not turn this thread into people discussing/reviewing
Jim's work that he plans to implement in some fork taken from Postgres
10.5. I don't really think even this mailing list is the right place
for that, let alone this thread.

> > I've not read the patch, but David's reaction makes it sound like its
> > processing is done too early. There are right places and wrong places
> > to do most everything in the planner, and I do not wish to accept a
> > patch that does something in the wrong place.
>
> Right, I definitely agree with you there. This seemed like a reasonable
> place given the similar optimization (at least in appearance to me)
> being done there for the GROUP BY case. I'm happy to admit that I
> haven't looked at it in very much depth (hence my question to David) and
> I'm not an expert in this area, but I did want to bring up that the
> general idea and the relative trade-offs at least sounded reasonable.

You might be confusing #1 and #2. My concern is with #2. The existing
GROUP BY clause optimisation is almost identical to #1. I just wanted
to also apply it to the DISTINCT clause.

--
David Rowley http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tatsuro Yamada 2018-08-24 04:28:00 Re: Fix help option of contrib/oid2name
Previous Message Andres Freund 2018-08-24 02:50:43 Re: Pluggable Storage - Andres's take