Re: [PATCH] Erase the distinctClause if the result is unique by definition

From: David Rowley <dgrowleyml(at)gmail(dot)com>
To: Andy Fan <zhihui(dot)fan1213(at)gmail(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Ashutosh Bapat <ashutosh(dot)bapat(dot)oss(at)gmail(dot)com>, Julien Rouhaud <rjuju123(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: [PATCH] Erase the distinctClause if the result is unique by definition
Date: 2020-03-10 23:05:02
Message-ID: CAApHDvqg+mQyxJnCizE=qJcBL90L=oFXTFyiwWWEaUnzG7Uc5Q@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, 10 Mar 2020 at 21:19, Andy Fan <zhihui(dot)fan1213(at)gmail(dot)com> wrote:
>
>> SELECT DISTINCT max(non_unique) FROM t1; to skip doing the DISTINCT part.
>
>
> Actually I want to keep the distinct for this case now. One reason is there are only 1
> row returned, so the distinct erasing can't help much. The more important reason is
> Query->hasAggs is true for "select distinct (select count(*) filter (where t2.c2 = 6
> and t2.c1 < 10) from ft1 t1 where t1.c1 = 6) from ft2 t2 where t2.c2 % 6 = 0 order by 1;"
> (this sql came from postgres_fdw.sql).

I think that sort of view is part of the problem here. If you want to
invent some new way to detect uniqueness that does not count that case
then we have more code with more possible places to have bugs.

FWIW, query_is_distinct_for() does detect that case with:

/*
* If we have no GROUP BY, but do have aggregates or HAVING, then the
* result is at most one row so it's surely unique, for any operators.
*/
if (query->hasAggs || query->havingQual)
return true;

which can be seen by the fact that the following find the unique join on t2.

postgres=# explain verbose select * from t1 inner join (select
count(*) c from t1) t2 on t1.a=t2.c;
QUERY PLAN
------------------------------------------------------------------------------------
Hash Join (cost=41.91..84.25 rows=13 width=12)
Output: t1.a, (count(*))
Inner Unique: true
Hash Cond: (t1.a = (count(*)))
-> Seq Scan on public.t1 (cost=0.00..35.50 rows=2550 width=4)
Output: t1.a
-> Hash (cost=41.89..41.89 rows=1 width=8)
Output: (count(*))
-> Aggregate (cost=41.88..41.88 rows=1 width=8)
Output: count(*)
-> Seq Scan on public.t1 t1_1 (cost=0.00..35.50
rows=2550 width=0)
Output: t1_1.a
(12 rows)

It will be very simple to add an empty List of UniqueKeys to the GROUP
BY's RelOptInfo to indicate that all expressions are unique. That way
any code that checks if some of the RelOptInfo's unique keys are a
subset of some expressions they'd like to know are unique, then
they'll get a match.

It does not really matter how much effort is saved in your example
above. The UniqueKey infrastructure won't know how much effort
properly adding all the uniquekeys will save. It should just add all
the keys it can and let whichever code cares about that reap the
benefits.

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2020-03-10 23:16:18 Re: [PATCH] Fix for slow GIN index queries when "gin_fuzzy_search_limit" setting is relatively small for large tables
Previous Message David Rowley 2020-03-10 22:48:48 Re: [PATCH] Erase the distinctClause if the result is unique by definition