Quick Links

Re: [PATCH] Erase the distinctClause if the result is unique by definition

From:	Ashutosh Bapat <ashutosh(dot)bapat(dot)oss(at)gmail(dot)com>
To:	David Rowley <dgrowleyml(at)gmail(dot)com>
Cc:	Andy Fan <zhihui(dot)fan1213(at)gmail(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Julien Rouhaud <rjuju123(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject:	Re: [PATCH] Erase the distinctClause if the result is unique by definition
Date:	2020-03-10 09:53:44
Message-ID:	CAExHW5tWqOEjhy8yOjPSk6=MYOGEtVaDsWmm+zyf69rUTHwL1g@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

On Tue, Mar 10, 2020 at 3:51 AM David Rowley <dgrowleyml(at)gmail(dot)com> wrote:

> On Sat, 7 Mar 2020 at 00:47, Andy Fan <zhihui(dot)fan1213(at)gmail(dot)com> wrote:
> > Upload the newest patch so that the cfbot can pass. The last patch
> failed
> > because some explain without the (cost off).
>
> I've only really glanced at this patch, but I think we need to do this
> in a completely different way.
>
> I've been mentioning UniqueKeys around this mailing list for quite a
> while now [1]. To summarise the idea:
>
> 1. Add a new List field to RelOptInfo named unique_keys
> 2. During get_relation_info() process the base relation's unique
> indexes and add to the RelOptInfo's unique_keys list the indexed
> expressions from each unique index (this may need to be delayed until
> check_index_predicates() since predOK is only set there)
> 3. Perhaps in add_paths_to_joinrel(), or maybe when creating the join
> rel itself (I've not looked for the best location in detail),
>

build_*_join_rel() will be a good place for this. The paths created might
take advantage of this information for costing.

> determine if the join can cause rows to be duplicated. If it can't,
> then add the UniqueKeys from that rel. For example: SELECT * FROM t1
> INNER JOIN t2 ON t1.unique = t2.not_unique; would have the joinrel for
> {t1,t2} only take the unique keys from t2 (t1 can't duplicate t2 rows
> since it's an eqijoin and t1.unique has a unique index).

this is interesting.

> If the
> condition was t1.unique = t2.unique then we could take the unique keys
> from both sides of the join, and with t1.non_unique = t2.non_unique,
> we can take neither.
> 4. When creating the GROUP BY paths (when there are no aggregates),
> don't bother doing anything if the input rel's unique keys are a
> subset of the GROUP BY clause. Otherwise, create the group by paths
> and tag the new unique keys onto the GROUP BY rel.
> 5. When creating the DISTINCT paths, don't bother if the input rel has
> unique keys are a subset of the distinct clause.
>

Thanks for laying this out in more details. Two more cases can be added to
this
6. When creating RelOptInfo for a grouped/aggregated result, if all the
columns of a group by clause are part of the result i.e. targetlist, the
columns in group by clause server as the unique keys of the result. So the
corresponding RelOptInfo can be marked as such.
7. The result of DISTINCT is unique for the columns contained in the
DISTINCT clause. Hence we can add those columns to the unique_key of the
RelOptInfo representing the result of the distinct clause.
8. If each partition of a partitioned table has a unique key with the same
columns in it and that happens to be superset of the partition key, then
the whole partitioned table gets that unique key as well.

With this we could actually pass the uniqueness information through
Subquery scans as well and the overall query will benefit with that.

>
> 4 and 5 will mean that: SELECT DISTINCT non_unique FROM t1 GROUP BY
> non_unique will just uniquify once for the GROUP BY and not for the
> distinct. SELECT DISTINCT unique FROM t1 GROUP BY unique; won't do
> anything to uniquify the results.
>
> Because both 4 and 5 require that the uniquekeys are a subset of the
> distinct/group by clause, an empty uniquekey set would mean that the
> RelOptInfo returns no more than 1 row. That would allow your:
>
> SELECT DISTINCT max(non_unique) FROM t1; to skip doing the DISTINCT part.
>
> There's a separate effort in
> https://commitfest.postgresql.org/27/1741/ to implement some parts of
> the uniquekeys idea. However the implementation currently only covers
> adding the unique keys to Paths, not to RelOptInfos.
>

I haven't looked at that patch, but as discussed upthread, in this case we
want the uniqueness associated with the RelOptInfo and not the path.

>
> I also believe that the existing code in analyzejoins.c should be
> edited to make use of unique keys rather than how it looks at unique
> indexes and group by / distinct clauses.
>

+1.
--
Best Wishes,
Ashutosh Bapat

In response to

Re: [PATCH] Erase the distinctClause if the result is unique by definition at 2020-03-09 22:21:27 from David Rowley

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Alexander Korotkov	2020-03-10 10:05:40	Re: [Patch] pg_rewind: options to use restore_command from recovery.conf or command line
Previous Message	Peter Eisentraut	2020-03-10 09:35:51	Re: Remove utils/acl.h from catalog/objectaddress.h