Re: [PATCH] Teach planner to further optimize sort in distinct

From: David Rowley <dgrowleyml(at)gmail(dot)com>
To: Ankit Kumar Pandey <itsankitkp(at)gmail(dot)com>
Cc: pghackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: [PATCH] Teach planner to further optimize sort in distinct
Date: 2023-01-20 00:37:52
Message-ID: CAApHDvr_tfwbkcmiPJtM+SNz81KcP0efJKRmu0hEEcKS0BQsXw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Fri, 20 Jan 2023 at 08:26, Ankit Kumar Pandey <itsankitkp(at)gmail(dot)com> wrote:
> > On 19/01/23 18:49, David Rowley wrote:
> > You can just switch to using that function in
> > create_final_distinct_paths(). You'll need to consider if the query is
> > a DISTINCT ON query and not try the unordered version of the function
> > in that case.
>
> Tried this, it worked as expected. Tests are green as well.

Looking at the patch, you've not added any additional tests. If the
existing tests are all passing then that just tells me that either the
code is not functioning as intended or we have no tests that look at
the EXPLAIN output which can make use of this optimization. If the
former is true, then the patch needs to be fixed. If it's the latter
then you need to write new tests.

> > It's possible that truncate_useless_pathkeys() needs to be modified to
> > check if the pathkeys might be useful for DISTINCT
>
> We have pathkeys_useful_for_merging and pathkeys_useful_for_ordering.
>
> Can we not have pathkeys_useful_for_distinct?

I don't know all the repercussions. If you look at add_path() then
you'll see we do a pathkey comparison when the costs are not fuzzily
different from an existing path so that we try to keep a path with the
best pathkeys. If we start keeping paths around with other weird
pathkeys that are not along the lines of the query_pathkeys requires,
then add_path might start throwing away paths that are actually good
for something. It seems probable that could cause some regressions.

> I have attached patch with pathkeys_count_contained_in_unordered
> and corresponding changes in create_final_distinct_paths for reference.

Does this patch actually work? I tried:

create table ab (a int, b int);
insert into ab select a,b from generate_Series(1,1000)
a,generate_series(1,1000) b;
analyze ab;
create index on ab(a);
set enable_hashagg=0;
explain select distinct b,a from ab where a < 10;
QUERY PLAN
------------------------------------------------------------------------------------
Unique (cost=729.70..789.67 rows=7714 width=8)
-> Sort (cost=729.70..749.69 rows=7996 width=8)
Sort Key: b, a
-> Index Scan using ab_a_idx on ab (cost=0.42..211.36
rows=7996 width=8)
Index Cond: (a < 10)
(5 rows)

I'd have expected an incremental sort here. I don't see that you're
adjusting IndexPath's pathkeys anywhere. The nested loop in
pathkeys_count_contained_in_unordered() seems to be inside out. The
reordered_pathkeys must have the common pathkeys in the order they
appear in keys2. In your patch, they'll be ordered according to the
keys1 list, which is wrong. Testing would tell you this, so all the
more reason to make the patch work and write some queries to ensure it
does actually work, then some tests to ensure that remains in a
working state.

Feel free to take the proper time to write a working patch which
contains new tests to ensure it's functioning as intended. It's
disheartening to review patches that don't seem to work. If it wasn't
meant to work, then you didn't make that clear. I'll likely not be
able to do any further reviews until the March commitfest, so it might
be better to only post if you're stuck. Please don't rush out the
next patch. Take your time and study the code and see if you can build
up your own picture for what the repercussions might be of IndexPaths
having additional pathkeys when they were previously empty. If you're
uncertain of aspects of the patch you've written feel free to leave
XXX type comments to indicate this. That way the reviewer will know
you might need more guidance on that and you'll not forget yourself
when you come back and look again after a few weeks.

David

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message David Rowley 2023-01-20 00:40:55 Re: refactoring relation extension and BufferAlloc(), faster COPY
Previous Message Greg Stark 2023-01-20 00:37:48 Re: Unicode grapheme clusters