Re: [PATCH] Erase the distinctClause if the result is unique by definition

From: David Rowley <dgrowleyml(at)gmail(dot)com>
To: Andy Fan <zhihui(dot)fan1213(at)gmail(dot)com>
Cc: Ashutosh Bapat <ashutosh(dot)bapat(dot)oss(at)gmail(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Julien Rouhaud <rjuju123(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: [PATCH] Erase the distinctClause if the result is unique by definition
Date: 2020-03-12 07:51:11
Message-ID: CAApHDvrBXjAvH45dEAZFOk-hzOt1mJC7-fxZ2v49mc5njtA7VQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, 11 Mar 2020 at 17:23, Andy Fan <zhihui(dot)fan1213(at)gmail(dot)com> wrote:
> Now I am convinced that we should maintain UniquePath on RelOptInfo,
> I would see how to work with "Index Skip Scan" patch.

I've attached a very early proof of concept patch for unique keys.
The NULL detection stuff is not yet hooked up, so it'll currently do
the wrong thing for NULLable columns. I've left some code in there
with my current idea of how to handle that, but I'll need to add more
code both to look at the catalogue tables to see if there's a NOT NULL
constraint and also to check for strict quals that filter out NULLs.

Additionally, I've not hooked up the collation checking stuff yet. I
just wanted to see if it would work ok for non-collatable types first.

I've added a couple of lines to create_distinct_paths() to check if
the input_rel has the required UniqueKeys to skip doing the DISTINCT.
It seems to work, but my tests so far are limited to:

create table t1(a int primary key, b int);
create table t2(a int primary key, b int);

postgres=# -- t2 could duplicate t1, don't remove DISTINCT
postgres=# explain (costs off) select distinct t1.a from t1 inner join
t2 on t1.a = t2.b;
QUERY PLAN
----------------------------------
HashAggregate
Group Key: t1.a
-> Hash Join
Hash Cond: (t2.b = t1.a)
-> Seq Scan on t2
-> Hash
-> Seq Scan on t1
(7 rows)

postgres=# -- neither rel can duplicate the other due to join on PK.
Remove DISTINCT
postgres=# explain (costs off) select distinct t1.a from t1 inner join
t2 on t1.a = t2.a;
QUERY PLAN
----------------------------
Hash Join
Hash Cond: (t1.a = t2.a)
-> Seq Scan on t1
-> Hash
-> Seq Scan on t2
(5 rows)

postgres=# -- t2.a cannot duplicate t1 and t1.a is unique. Remove DISTINCT
postgres=# explain (costs off) select distinct t1.a from t1 inner join
t2 on t1.b = t2.a;
QUERY PLAN
----------------------------
Hash Join
Hash Cond: (t1.b = t2.a)
-> Seq Scan on t1
-> Hash
-> Seq Scan on t2
(5 rows)

postgres=# -- t1.b can duplicate t2.a. Don't remove DISTINCT
postgres=# explain (costs off) select distinct t2.a from t1 inner join
t2 on t1.b = t2.a;
QUERY PLAN
----------------------------------
HashAggregate
Group Key: t2.a
-> Hash Join
Hash Cond: (t1.b = t2.a)
-> Seq Scan on t1
-> Hash
-> Seq Scan on t2
(7 rows)

postgres=# -- t1.a cannot duplicate t2.a. Remove DISTINCT.
postgres=# explain (costs off) select distinct t2.a from t1 inner join
t2 on t1.a = t2.b;
QUERY PLAN
----------------------------
Hash Join
Hash Cond: (t2.b = t1.a)
-> Seq Scan on t2
-> Hash
-> Seq Scan on t1
(5 rows)

I've also left a bunch of XXX comments for things that I know need more thought.

I believe we can propagate the joinrel's unique keys where the patch
is currently doing it. I understand that in
populate_joinrel_with_paths() we do things like swapping LEFT JOINs
for RIGHT JOINs and switch the input rels around, but we do so only
because it's equivalent, so I don't currently see why we can't take
the jointype for the SpecialJoinInfo. I need to know that as I'll need
to ignore pushed down RestrictInfos for outer joins.

I'm posting now as I know I've been mentioning this UniqueKeys idea
for quite a while and if it's not something that's going to get off
the ground, then it's better to figure that out now.

Attachment Content-Type Size
bearly_poc_uniquekeys_v0.patch application/octet-stream 13.0 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Pengzhou Tang 2020-03-12 08:35:15 Additional size of hash table is alway zero for hash aggregates
Previous Message Michael Paquier 2020-03-12 07:49:41 Re: Add an optional timeout clause to isolationtester step.