Re: [PATCH] Erase the distinctClause if the result is unique by definition

From: Andy Fan <zhihui(dot)fan1213(at)gmail(dot)com>
To: David Rowley <dgrowleyml(at)gmail(dot)com>
Cc: Ashutosh Bapat <ashutosh(dot)bapat(dot)oss(at)gmail(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Julien Rouhaud <rjuju123(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: [PATCH] Erase the distinctClause if the result is unique by definition
Date: 2020-03-13 01:47:10
Message-ID: CAKU4AWpOgpXqw__4mcXgtZm1rQoVEr3G0na2Oh_UCNcHOxdxqA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi David:

On Thu, Mar 12, 2020 at 3:51 PM David Rowley <dgrowleyml(at)gmail(dot)com> wrote:

> On Wed, 11 Mar 2020 at 17:23, Andy Fan <zhihui(dot)fan1213(at)gmail(dot)com> wrote:
> > Now I am convinced that we should maintain UniquePath on RelOptInfo,
> > I would see how to work with "Index Skip Scan" patch.
>
> I've attached a very early proof of concept patch for unique keys.
> The NULL detection stuff is not yet hooked up, so it'll currently do
> the wrong thing for NULLable columns. I've left some code in there
> with my current idea of how to handle that, but I'll need to add more
> code both to look at the catalogue tables to see if there's a NOT NULL
> constraint and also to check for strict quals that filter out NULLs.
>
> Additionally, I've not hooked up the collation checking stuff yet. I
> just wanted to see if it would work ok for non-collatable types first.
>
> I've added a couple of lines to create_distinct_paths() to check if
> the input_rel has the required UniqueKeys to skip doing the DISTINCT.
> It seems to work, but my tests so far are limited to:
>
> create table t1(a int primary key, b int);
> create table t2(a int primary key, b int);
>
> postgres=# -- t2 could duplicate t1, don't remove DISTINCT
> postgres=# explain (costs off) select distinct t1.a from t1 inner join
> t2 on t1.a = t2.b;
> QUERY PLAN
> ----------------------------------
> HashAggregate
> Group Key: t1.a
> -> Hash Join
> Hash Cond: (t2.b = t1.a)
> -> Seq Scan on t2
> -> Hash
> -> Seq Scan on t1
> (7 rows)
>
>
> postgres=# -- neither rel can duplicate the other due to join on PK.
> Remove DISTINCT
> postgres=# explain (costs off) select distinct t1.a from t1 inner join
> t2 on t1.a = t2.a;
> QUERY PLAN
> ----------------------------
> Hash Join
> Hash Cond: (t1.a = t2.a)
> -> Seq Scan on t1
> -> Hash
> -> Seq Scan on t2
> (5 rows)
>
>
> postgres=# -- t2.a cannot duplicate t1 and t1.a is unique. Remove DISTINCT
> postgres=# explain (costs off) select distinct t1.a from t1 inner join
> t2 on t1.b = t2.a;
> QUERY PLAN
> ----------------------------
> Hash Join
> Hash Cond: (t1.b = t2.a)
> -> Seq Scan on t1
> -> Hash
> -> Seq Scan on t2
> (5 rows)
>
>
> postgres=# -- t1.b can duplicate t2.a. Don't remove DISTINCT
> postgres=# explain (costs off) select distinct t2.a from t1 inner join
> t2 on t1.b = t2.a;
> QUERY PLAN
> ----------------------------------
> HashAggregate
> Group Key: t2.a
> -> Hash Join
> Hash Cond: (t1.b = t2.a)
> -> Seq Scan on t1
> -> Hash
> -> Seq Scan on t2
> (7 rows)
>
>
> postgres=# -- t1.a cannot duplicate t2.a. Remove DISTINCT.
> postgres=# explain (costs off) select distinct t2.a from t1 inner join
> t2 on t1.a = t2.b;
> QUERY PLAN
> ----------------------------
> Hash Join
> Hash Cond: (t2.b = t1.a)
> -> Seq Scan on t2
> -> Hash
> -> Seq Scan on t1
> (5 rows)
>
> I've also left a bunch of XXX comments for things that I know need more
> thought.
>
> I believe we can propagate the joinrel's unique keys where the patch
> is currently doing it. I understand that in
> populate_joinrel_with_paths() we do things like swapping LEFT JOINs
> for RIGHT JOINs and switch the input rels around, but we do so only
> because it's equivalent, so I don't currently see why we can't take
> the jointype for the SpecialJoinInfo. I need to know that as I'll need
> to ignore pushed down RestrictInfos for outer joins.
>
> I'm posting now as I know I've been mentioning this UniqueKeys idea
> for quite a while and if it's not something that's going to get off
> the ground, then it's better to figure that out now.
>

Thanks for the code! Here is some points from me.

1. for pupulate_baserel_uniquekeys, we need handle the "pk = Const" as
well.
(relation_has_unqiue_for has a similar logic) currently the following
distinct path is still
there.

postgres=# explain select distinct b from t100 where pk = 1;
QUERY PLAN
----------------------------------------------------------------------------------
Unique (cost=8.18..8.19 rows=1 width=4)
-> Sort (cost=8.18..8.19 rows=1 width=4)
Sort Key: b
-> Index Scan using t100_pkey on t100 (cost=0.15..8.17 rows=1
width=4)
Index Cond: (pk = 1)
(5 rows)

I think in this case, we can add both (pk) and (b) as the UniquePaths.
If so we
can get more opportunities to reach our goal.

2. As for the propagate_unique_keys_to_joinrel, we can add 1 more
UniquePath as
(rel1_unique_paths, rel2_unique_paths) if the current rules doesn't apply.
or else the following cases can't be handled.

postgres=# explain select distinct t100.pk, t101.pk from t100, t101;
QUERY PLAN
--------------------------------------------------------------------------------
Unique (cost=772674.11..810981.11 rows=5107600 width=8)
-> Sort (cost=772674.11..785443.11 rows=5107600 width=8)
Sort Key: t100.pk, t101.pk
-> Nested Loop (cost=0.00..63915.85 rows=5107600 width=8)
-> Seq Scan on t100 (cost=0.00..32.60 rows=2260 width=4)
-> Materialize (cost=0.00..43.90 rows=2260 width=4)
-> Seq Scan on t101 (cost=0.00..32.60 rows=2260
width=4)
(7 rows)

But if we add such rule, the unique paths probably become much longer, so
we need
a strategy to tell if the UniquePath is useful for our query, if not, we
can ignore that.
rel->reltarget maybe a good info for such optimization. I think we can
take this into
consideration for pupulate_baserel_uniquekeys as well.

For the non_null info, Tom suggested to add maintain such info RelOptInfo,
I have done that for the not_null_info for basic relation catalog, I think
we can
maintain the same flag for joinrel and the not null info from
find_nonnullable_vars as
well, but I still didn't find a good place to add that so far.

A small question about the following code:

+ if (relation_has_uniquekeys_for(root, input_rel,
get_sortgrouplist_exprs(parse->distinctClause, parse->targetList), false))
+ {
+
+ add_path(distinct_rel, (Path *)cheapest_input_path);
+
+ /* XXX yeah yeah, need to call the hooks etc. */
+
+ /* Now choose the best path(s) */
+ set_cheapest(distinct_rel);
+
+ return distinct_rel;
+ }

Since we don't create new RelOptInfo/Path, do we need to call add_path and
set_cheapest?

Best Regards
Andy Fan

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Amit Kapila 2020-03-13 02:59:27 Re: [HACKERS] Moving relation extension locks out of heavyweight lock manager
Previous Message Kyotaro Horiguchi 2020-03-13 01:11:44 Re: [PATCH] Skip llvm bytecode generation if LLVM is missing