[PATCH] query rewrite for distinct stage under some cases

From: Andy Fan <zhihui(dot)fan1213(at)gmail(dot)com>
To: PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: [PATCH] query rewrite for distinct stage under some cases
Date: 2020-01-20 03:41:49
Message-ID: CAKU4AWrX6t8MEG34wvRvHZwriyaHgGQ-x6ZycKMhSHYTkUuKuw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi Hackers:

This is a patch for unique elimination rewrite for distinct query.

it will cost much for a big result set and some times it is not

necessary. The basic idea is the unique node like in the following

can be eliminated.

1. select distinct pk, ... from t;

2. select distinct uk-col1, uk-col2, ...

from t where uk-col1 is not null and uk-col2 is not null;

3. select distinct a, b .... from t group by a, b;

4. select distinct t1.pk, t2.pk, ... from t1, t2.

The distinct keyword in above sql is obviously redundant,

But a SQL may join a lot of tables with tens of columns in target

list and a number of indexes. Finally the sql is hidden in

hundreds of SQL in system, it will be hard to find it out.

That's why I want the kernel can keep watching it,

based on that it will not cost too much. Oracle has similar rewrite

as well.

The rule for single relation is:

a). The primary key is choose in target list.

b). The unique key is choose in the target list,

and we can tell the result of the related column is not nullable.

we can tell it by catalog and qual.

c). The group-by columns is choose in target list.

d). The target list in subquery has a distinct already.

(select distinct xxx from (select distinct xxx from t2));

The rule for multi-relations join is:

e). if any relation yield a unique result, then the result of join will be

unique as well

If an sql matches any rule of above, we can remove the unique node.

Rule d) is not so common and complex to implement, so it is not

included in this patch.

Implementation:

f). I choose the target list per table, if there is hasDistinctOn, the
source

is the target list intersect distinctClause. or else, the source is
target list only.

g). the pk/uk columns information is gathered
by RelationGetIndexAttrBitmap.

a new filed RelationData.plain_uk_ukattrs is added and gathered as
well.

h). As last if any rule matches, Query->distinctClause &
Query->hasDistinctOn

will be cleared to avoid generating the related path.

There are also some fast paths to return earlier:

i). If a table in join-list, but no columns is choose in target list.

j). The join-list contains sub-query. (this rewrite happens after
sub-query pull-up)

k). Based on the cost of the checking, we check group by first and

then PK and then UK + not null.

There is no impact for non-distinct query, as for distinct query, this rule
will

increase the total cost a bit if the distinct can't be removed. The unique

check is most expensive, so here is the data to show the impact, a 4

columns table, no pk, 1 uk with 2 columns.

With this feature disable: avg plan time: 0.095ms

With this feature enabled: avg plan time: 0.102ms

Basically I think the cost would be ok.

Concurrency:

l). When we see a pk or uk index, so we remove the index on another
session,

I think this would be ok because of MVCC rules.

m). When we are creating an index in another session but it is not
completed,

suppose we can't get it with RelationGetIndexAttrBitmap. so it should be
ok

as well.

The behavior can be changed online with enable_unique_elimination,

it is true by default.

The patch is generated with the latest code on github,

and the current HEAD is 34a0a81bfb388504deaa51b16a8bb531b827e519.

The make installcheck-world & check-world has pass.

Test case join.sql and sysview.sql are impacted by this change

and they are expected, the changed expected.out file is included in this
patch.

Please let me know if you have any questions.

Thank you

Attachment Content-Type Size
0001-Avoid-the-distinct-stage-if-the-result-is-unique-alr.patch application/octet-stream 22.9 KB

Browse pgsql-hackers by date

  From Date Subject
Next Message Masahiko Sawada 2020-01-20 03:51:46 Re: [HACKERS] Block level parallel vacuum
Previous Message Amit Kapila 2020-01-20 03:39:35 Re: [HACKERS] Block level parallel vacuum