Re: [PATCH] Erase the distinctClause if the result is unique by definition

From: Ashutosh Bapat <ashutosh(dot)bapat(dot)oss(at)gmail(dot)com>
To: Andy Fan <zhihui(dot)fan1213(at)gmail(dot)com>
Cc: PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: [PATCH] Erase the distinctClause if the result is unique by definition
Date: 2020-02-07 15:54:27
Message-ID: CAExHW5ueQDkLJrEqGh6LS_9TEEH2GF3JDz3qnM0L9bHM-7F7bQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi Andy,
What might help is to add more description to your email message like
giving examples to explain your idea.

Anyway, I looked at the testcases you added for examples.
+create table select_distinct_a(a int, b char(20), c char(20) not null, d
int, e int, primary key(a, b));
+set enable_mergejoin to off;
+set enable_hashjoin to off;
+-- no node for distinct.
+explain (costs off) select distinct * from select_distinct_a;
+ QUERY PLAN
+-------------------------------
+ Seq Scan on select_distinct_a
+(1 row)

From this example, it seems that the distinct operation can be dropped
because (a, b) is a primary key. Is my understanding correct?

I like the idea since it eliminates one expensive operation.

However the patch as presented has some problems
1. What happens if the primary key constraint or NOT NULL constraint gets
dropped between a prepare and execute? The plan will no more be valid and
thus execution may produce non-distinct results. PostgreSQL has similar
concept of allowing non-grouping expression as part of targetlist when
those expressions can be proved to be functionally dependent on the GROUP
BY clause. See check_functional_grouping() and its caller. I think,
DISTINCT elimination should work on similar lines.
2. For the same reason described in check_functional_grouping(), using
unique indexes for eliminating DISTINCT should be discouraged.
3. If you could eliminate DISTINCT you could similarly eliminate GROUP BY
as well
4. The patch works only at the query level, but that functionality can be
expanded generally to other places which add Unique/HashAggregate/Group
nodes if the underlying relation can be proved to produce distinct rows.
But that's probably more work since we will have to label paths with unique
keys similar to pathkeys.
5. Have you tested this OUTER joins, which can render inner side nullable?

On Thu, Feb 6, 2020 at 11:31 AM Andy Fan <zhihui(dot)fan1213(at)gmail(dot)com> wrote:

> update the patch with considering the semi/anti join.
>
> Can anyone help to review this patch?
>
> Thanks
>
>
> On Fri, Jan 31, 2020 at 8:39 PM Andy Fan <zhihui(dot)fan1213(at)gmail(dot)com> wrote:
>
>> Hi:
>>
>> I wrote a patch to erase the distinctClause if the result is unique by
>> definition, I find this because a user switch this code from oracle
>> to PG and find the performance is bad due to this, so I adapt pg for
>> this as well.
>>
>> This patch doesn't work for a well-written SQL, but some drawback
>> of a SQL may be not very obvious, since the cost of checking is pretty
>> low as well, so I think it would be ok to add..
>>
>> Please see the patch for details.
>>
>> Thank you.
>>
>

--
--
Best Wishes,
Ashutosh Bapat

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Chapman Flack 2020-02-07 16:08:48 Does recovery write to backup_label ?
Previous Message Alexey Bashtanov 2020-02-07 15:22:12 Re: improve transparency of bitmap-only heap scans