Re: [PATCH] Erase the distinctClause if the result is unique by definition

From: Andy Fan <zhihui(dot)fan1213(at)gmail(dot)com>
To: David Rowley <dgrowleyml(at)gmail(dot)com>
Cc: Ashutosh Bapat <ashutosh(dot)bapat(dot)oss(at)gmail(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Julien Rouhaud <rjuju123(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: [PATCH] Erase the distinctClause if the result is unique by definition
Date: 2020-03-18 02:56:58
Message-ID: CAKU4AWqOORqW900O-+L4L2+0xknsEqpfcs9FF7SeiO9TmpeZOg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi David:

Thanks for your time.

On Wed, Mar 18, 2020 at 9:56 AM David Rowley <dgrowleyml(at)gmail(dot)com> wrote:

> On Mon, 16 Mar 2020 at 06:01, Andy Fan <zhihui(dot)fan1213(at)gmail(dot)com> wrote:
> >
> > Hi All:
> >
> > I have re-implemented the patch based on David's suggestion/code, Looks
> it
> > works well. The updated patch mainly includes:
> >
> > 1. Maintain the not_null_colno in RelOptInfo, which includes the not
> null from
> > catalog and the not null from vars.
>
> What about non-nullability that we can derive from means other than
> NOT NULL constraints. Where will you track that now that you've
> removed the UniqueKeySet type?
>

I tracked it in 'deconstruct_recurse', just before
the distribute_qual_to_rels call.

+ ListCell *lc;
+ foreach(lc, find_nonnullable_vars(qual))
+ {
+ Var *var = lfirst_node(Var, lc);
+ RelOptInfo *rel =
root->simple_rel_array[var->varno];
+ if (var->varattno > InvalidAttrNumber)
+ rel->not_null_cols =
bms_add_member(rel->not_null_cols, var->varattno);
+ }

> Traditionally we use attno or attnum rather than colno for variable
> names containing attribute numbers
>

Currently I use a list of Var for a UnqiueKey, I guess it is ok?

>
> > 3. postpone the propagate_unique_keys_to_joinrel call to
> populate_joinrel_with_paths
> > since we know jointype at that time. so we can handle the semi/anti
> join specially.
>
> ok, but the join type was known already where I was calling the
> function from. It just wasn't passed to the function.
>
> > 4. Add the rule I suggested above, if both of the 2 relation yields the
> a unique result,
> > the join result will be unique as well. the UK can be ( (rel1_uk1,
> rel1_uk2).. )
>
> I see. So basically you're saying that the joinrel's uniquekeys should
> be the cartesian product of the unique rels from either side of the
> join. I wonder if that's a special case we need to worry about too
> much. Surely it only applies for clauseless joins

Some other cases we may need this as well:). like select m1.pk, m2.pk
from m1, m2
where m1.b = m2.b;

The cartesian product of the unique rels will make the unqiue keys too
long, so I maintain
the UnqiueKeyContext to make it short. The idea is if (UK1) is unique
already, no bother
to add another UK as (UK1, UK2) which is just a superset of it.

>
>
> 5. If the unique key is impossible to be referenced by others, we can
> safely ignore
> > it in order to keep the (join)rel->unqiuekeys short.
>
> You could probably have an equivalent of has_useful_pathkeys() and
> pathkeys_useful_for_ordering()
>
>
Thanks for suggestion, I will do so in the v5-xx.patch.

> > 6. I only consider the not null check/opfamily check for the uniquekey
> which comes
> > from UniqueIndex. I think that should be correct.
> > 7. I defined each uniquekey as List of Expr, so I didn't introduce new
> node type.
>
> Where will you store the collation Oid? I left comments to mention
> that needed to be checked but just didn't wire it up.
>

This is too embarrassed, I am not sure if it is safe to ignore it. I
removed it due to
the following reasons (sorry for that I didn't explain it carefully for the
last email).

1. When we choose if an UK is usable, we have chance to compare the
collation info
for restrictinfo (where uk = 1) or target list (select uk from t) with
the indexinfo's collation,
the targetlist one has more trouble since we need to figure out the default
collation for it.
However relation_has_unique_index_for has the same situation as us, it
ignores it as well.
See comment /* XXX at some point we may need to check collations here too.
*/. It think
if there are some reasons we can ignore that.

2. What we expect from UK is:
a). Where m1.uniquekey = m2.b m2.uk will not be duplicated by this
joinclause. Here
if m1.uk has different collation, it will raise runtime error.
b). Where m1.uniquekey collate 'xxxx' = m2.b. We may can't depends on
the run-time error this time. But if we are sure that *if uk is uk at
default collation is unique,
then (uk collcate 'other-colation') is unique as well**. if so we may safe
ignore it as well.
c). select uniquekey from t / select uniquekey collate 'xxxx' from t.
This have the same
requirement as item b).

3). Looks maintain the collation for our case totally is a big effort,
and user rarely use it, If
my expectation for 2.b is not true, I prefer to detect such case (user use
a different collation),
we can just ignore the UK for that.

But After all, I think this should be an open question for now.

---
At last, I am so grateful for your suggestion/feedback, that's really
heuristic and constructive.
And so thanks Tom's for the quick review and suggest to add a new fields
for RelOptInfo,
without it I don't think I can add a new field to a so important struct.
And also thanks Bapat who
explains the thing more detailed. I'm now writing the code for partition
index stuff, which
is a bit of boring, since every partition may have different unique index.
I am expecting that
I can finish it in the following 2 days, and hope you can have another
round of review again.

Thanks for your feedback!

Best Regards
Andy Fan

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2020-03-18 02:58:54 Re: Small docs bugfix: make it clear what can be used in UPDATE FROM and DELETE USING
Previous Message Michael Paquier 2020-03-18 02:48:37 Re: ALTER tbl rewrite loses CLUSTER ON index