Re: [PATCH] Erase the distinctClause if the result is unique by definition

From: Andy Fan <zhihui(dot)fan1213(at)gmail(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Ashutosh Bapat <ashutosh(dot)bapat(dot)oss(at)gmail(dot)com>, Julien Rouhaud <rjuju123(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: [PATCH] Erase the distinctClause if the result is unique by definition
Date: 2020-03-02 17:24:57
Message-ID: CAKU4AWpjMb1tt+qR8E7aA-09thrqUcJ6SYsp4hJwVMAf+NWSpQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Thank you Tom for the review!

On Mon, Mar 2, 2020 at 4:46 AM Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:

> Andy Fan <zhihui(dot)fan1213(at)gmail(dot)com> writes:
> > Please see if you have any comments. Thanks
>
> The cfbot isn't at all happy with this. Its linux build is complaining
> about a possibly-uninitialized variable, and then giving up:
>
> https://travis-ci.org/postgresql-cfbot/postgresql/builds/656722993
>
> The Windows build isn't using -Werror, but it is crashing in at least
> two different spots in the regression tests:
>
> https://ci.appveyor.com/project/postgresql-cfbot/postgresql/build/1.0.81778
>
> I've not attempted to identify the cause of that.
>
>
Before I submit the patch, I can make sure "make check-world" is
successful, but
since the compile option is not same, so I didn't catch
the possibly-uninitialized
variable. As for the crash on the windows, I didn't get the enough
information
now, I will find a windows server and reproduce the cases.

I just found the link http://commitfest.cputube.org/ this morning, I will
make sure
the next patch can go pass this test.

At a high level, I'm a bit disturbed that this focuses only on DISTINCT
> and doesn't (appear to) have any equivalent intelligence for GROUP BY,
> though surely that offers much the same opportunities for optimization.
> It seems like it'd be worthwhile to take a couple steps back and see
> if we couldn't recast the logic to work for both.
>
>
OK, Looks group by is a bit harder than distinct since the aggregation
function.
I will go through the code to see why to add this logic.

> Some other random comments:
>
> * Don't especially like the way you broke query_is_distinct_for()
> into two functions, especially when you then introduced a whole
> lot of other code in between.

This is not expected by me until you point it out. In this case, I have to
break the query_is_distinct_for to two functions, but it true that we
should put the two functions together.

That's just making reviewer's job
> harder to see what changed. It makes the comments a bit disjointed
> too, that is where you even had any. (Zero introductory comment
> for query_is_distinct_agg is *not* up to project coding standards.
> There are a lot of other undercommented places in this patch, too.)
>
> * Definitely don't like having query_distinct_through_join re-open
> all the relations. The data needed for this should get absorbed
> while plancat.c has the relations open at the beginning. (Non-nullness
> of columns, in particular, seems like it'll be useful for other
> purposes; I'm a bit surprised the planner isn't using that already.)
>

I can add new attributes to RelOptInfo and fill the value in
get_relation_info
call.

> * In general, query_distinct_through_join seems hugely messy, expensive,
> and not clearly correct. If it is correct, the existing comments sure
> aren't enough to explain what it is doing or why.

>
Removing the relation_open call can make it a bit simpler, I will try more
comment to make it clearer in the following patch.

> * There seem to be some pointless #include additions, eg in planner.c
> the added code doesn't look to justify any of them. Please also
> avoid unnecessary whitespace changes, those also slow down reviewing.
>
>
That may because I added the header file some time 1 and then refactored
the code later then forget the remove the header file accordingly. Do we
need
to relay on experience to tell if the header file is needed or not, or do
have have
any code to tell it automatically?

> * I see you decided to add a new regression test file select_distinct_2.
> That's a poor choice of name because it conflicts with our rules for the
> naming of alternative output files. Besides which, you forgot to plug
> it into the test schedule files, so it isn't actually getting run.
> Is there a reason not to just add the new test cases to select_distinct?
>
>
Adding it to select_distinct.sql is ok for me as well. Actually I have no
obviously reason to add the new file.

> * There are some changes in existing regression cases that aren't
> visibly related to the stated purpose of the patch, eg it now
> notices that "select distinct max(unique2) from tenk1" doesn't
> require an explicit DISTINCT step. That's not wrong, but I wonder
> if maybe you should subdivide this patch into more than one patch,
> because that must be coming from some separate change. I'm also
> wondering what caused the plan change in expected/join.out.
>

Per my purpose it should be in the same patch, the logical here is we
have distinct in the sql and the query is distinct already since the max
function (the rule is defined in query_is_distinct_agg which is splited
from
the original query_is_distinct_for clause).

> regards, tom lane
>

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Alexey Kondratov 2020-03-02 17:59:49 Re: [Patch] pg_rewind: options to use restore_command from recovery.conf or command line
Previous Message Mark Dilger 2020-03-02 17:17:29 Re: Portal->commandTag as an enum