Re: [PATCH] Erase the distinctClause if the result is unique by definition

From: Ashutosh Bapat <ashutosh(dot)bapat(dot)oss(at)gmail(dot)com>
To: Andy Fan <zhihui(dot)fan1213(at)gmail(dot)com>
Cc: PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: [PATCH] Erase the distinctClause if the result is unique by definition
Date: 2020-02-10 16:22:40
Message-ID: CAExHW5t7ALZmaN8gL5DZV+en5G=4QTbKSYhBrXnSrKgCYNr_AA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Sat, Feb 8, 2020 at 12:53 PM Andy Fan <zhihui(dot)fan1213(at)gmail(dot)com> wrote:

> Hi Ashutosh:
> Thanks for your time.
>
> On Fri, Feb 7, 2020 at 11:54 PM Ashutosh Bapat <
> ashutosh(dot)bapat(dot)oss(at)gmail(dot)com> wrote:
>
>> Hi Andy,
>> What might help is to add more description to your email message like
>> giving examples to explain your idea.
>>
>> Anyway, I looked at the testcases you added for examples.
>> +create table select_distinct_a(a int, b char(20), c char(20) not null,
>> d int, e int, primary key(a, b));
>> +set enable_mergejoin to off;
>> +set enable_hashjoin to off;
>> +-- no node for distinct.
>> +explain (costs off) select distinct * from select_distinct_a;
>> + QUERY PLAN
>> +-------------------------------
>> + Seq Scan on select_distinct_a
>> +(1 row)
>>
>> From this example, it seems that the distinct operation can be dropped
>> because (a, b) is a primary key. Is my understanding correct?
>>
>
> Yes, you are correct. Actually I added then to commit message,
> but it's true that I should have copied them in this email body as
> well. so copy it now.
>
> [PATCH] Erase the distinctClause if the result is unique by
> definition
>

I forgot to mention this in the last round of comments. Your patch was
actually removing distictClause from the Query structure. Please avoid
doing that. If you remove it, you are also removing the evidence that this
Query had a DISTINCT clause in it.

>
>
> However the patch as presented has some problems
> 1. What happens if the primary key constraint or NOT NULL constraint gets
> dropped between a prepare and execute? The plan will no more be valid and
> thus execution may produce non-distinct results.
>
> Will this still be an issue if user use doesn't use a "read uncommitted"
> isolation level? I suppose it should be ok for this case. But even though
> I should add an isolation level check for this. Just added that in the
> patch
> to continue discussing of this issue.
>

In PostgreSQL there's no "read uncommitted". But that doesn't matter since
a query can be prepared outside a transaction and executed within one or
more subsequent transactions.

>
>
>> PostgreSQL has similar concept of allowing non-grouping expression as
>> part of targetlist when those expressions can be proved to be functionally
>> dependent on the GROUP BY clause. See check_functional_grouping() and its
>> caller. I think, DISTINCT elimination should work on similar lines.
>>
> 2. For the same reason described in check_functional_grouping(), using
>> unique indexes for eliminating DISTINCT should be discouraged.
>>
>
> I checked the comments of check_functional_grouping, the reason is
>
> * Currently we only check to see if the rel has a primary key that is a
> * subset of the grouping_columns. We could also use plain unique
> constraints
> * if all their columns are known not null, but there's a problem: we need
> * to be able to represent the not-null-ness as part of the constraints
> added
> * to *constraintDeps. FIXME whenever not-null constraints get represented
> * in pg_constraint.
>
> Actually I am doubtful the reason for pg_constraint since we still be able
> to get the not null information from
> relation->rd_attr->attrs[n].attnotnull which
> is just what this patch did.
>

The problem isn't whether not-null-less can be inferred or not, the problem
is whether that can be guaranteed across planning and execution of query
(prepare and execute for example.) The constraintDep machinary registers
the constraints used for preparing plan and invalidates the plan if any of
those constraints change after plan is created.

>
> 3. If you could eliminate DISTINCT you could similarly eliminate GROUP BY
>> as well
>>
>
> This is a good point. The rules may have some different for join, so I
> prefer
> to to focus on the current one so far.
>

I doubt that since DISTINCT is ultimately carried out as Grouping
operation. But anyway, I won't hang upon that.

>
>
>> 4. The patch works only at the query level, but that functionality can be
>> expanded generally to other places which add Unique/HashAggregate/Group
>> nodes if the underlying relation can be proved to produce distinct rows.
>> But that's probably more work since we will have to label paths with unique
>> keys similar to pathkeys.
>>
>
> Do you mean adding some information into PlannerInfo, and when we create
> a node for Unique/HashAggregate/Group, we can just create a dummy node?
>

Not so much as PlannerInfo but something on lines of PathKey. See PathKey
structure and related code. What I envision is PathKey class is also
annotated with the information whether that PathKey implies uniqueness.
E.g. a PathKey derived from a Primary index would imply uniqueness also. A
PathKey derived from say Group operation also implies uniqueness. Then just
by looking at the underlying Path we would be able to say whether we need
Group/Unique node on top of it or not. I think that would make it much
wider usecase and a very useful optimization.

--
Best Wishes,
Ashutosh Bapat

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Jonathan S. Katz 2020-02-10 16:46:23 2020-02-13 Press Release Draft
Previous Message Dmitry Igrishin 2020-02-10 16:18:19 Re: Is it memory leak or not?