Re: [PATCH] Keeps tracking the uniqueness with UniqueKey

From: Andy Fan <zhihui(dot)fan1213(at)gmail(dot)com>
To: Ashutosh Bapat <ashutosh(dot)bapat(dot)oss(at)gmail(dot)com>
Cc: David Rowley <dgrowleyml(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, rushabh(dot)lathia(at)gmail(dot)com, Ashutosh Bapat <ashutosh(dot)bapat(at)2ndquadrant(dot)com>
Subject: Re: [PATCH] Keeps tracking the uniqueness with UniqueKey
Date: 2020-05-08 01:57:24
Message-ID: CAKU4AWpBXBoRzz3_CnUu-rg5E1kR8z1MCQQqs1gTmODj3-mUAQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi Ashutosh:

Appreciate for your comments!

On Thu, May 7, 2020 at 7:26 PM Ashutosh Bapat <ashutosh(dot)bapat(dot)oss(at)gmail(dot)com>
wrote:

> Hi Andy,
> Sorry for delay in review.

I understand no one has obligation to do that, and it must take
reviewer's time
and more, so really appreciated for it! Hope I can provide more kindly
help like
this in future as well.

> Your earlier patches are very large and it
> requires some time to review those. I didn't finish reviewing those
> but here are whatever comments I have till now on the previous set of
> patches. Please see if any of those are useful to the new set.
>
> Yes, I just realized the size as well, so I split them to smaller commit
and
each commit and be build and run make check successfully.

All of your comments still valid except the last one
(covert_subquery_uniquekeys)
which has been fixed v7 version.

>
> +/*
> + * Return true iff there is an equal member in target for every
> + * member in members
>
> Suggest reword: return true iff every entry in "members" list is also
> present
> in the "target" list.

Will do, thanks!

> This function doesn't care about multi-sets, so please
> mention that in the prologue clearly.
>
> +
> + if (root->parse->hasTargetSRFs)
> + return;
>
> Why? A relation's uniqueness may be useful well before we work on SRFs.
>
>
Looks I misunderstand when the SRF function is executed. I will fix this
in v8.

+
> + if (baserel->reloptkind == RELOPT_OTHER_MEMBER_REL)
> + /*
> + * Set UniqueKey on member rel is useless, we have to recompute
> it at
> + * upper level, see populate_partitionedrel_uniquekeys for
> reference
> + */
> + return;
>
> Handling these here might help in bottom up approach. We annotate each
> partition here and then annotate partitioned table based on the individual
> partitions. Same approach can be used for partitioned join produced by
> partitionwise join.
>
> + /*
> + * If the unique index doesn't contain partkey, then it is unique
> + * on this partition only, so it is useless for us.
> + */
>
> Not really. It could help partitionwise join.
>
> +
> + /* Now we have the unique index list which as exactly same on all
> childrels,
> + * Set the UniqueIndex just like it is non-partition table
> + */
>
> I think it's better to annotate each partition with whatever unique index
> it
> has whether or not global. That will help partitionwise join, partitionwise
> aggregate/group etc.
>

Excellent catch! All the 3 items above is partitionwise join related, I
need some time
to check how to handle that.

>
> + /* A Normal group by without grouping set */
> + if (parse->groupClause)
> + add_uniquekey_from_sortgroups(root,
> + grouprel,
> + root->parse->groupClause);
>
> Those keys which are part of groupClause and also form unique keys in the
> input
> relation, should be recorded as unique keys in group rel. Knowing the
> minimal
> set of keys allows more optimizations.
>

This is a very valid point now. I ignored this because I wanted to remove
the AggNode
totally if the part of groupClause is unique, However it doesn't happen
later if there is
aggregation call in this query.

> +
> + foreach(lc, unionrel->reltarget->exprs)
> + {
> + exprs = lappend(exprs, lfirst(lc));
> + colnos = lappend_int(colnos, i);
> + i++;
> + }
>
> This should be only possible when it's not UNION ALL. We should add some
> assert
> or protection for that.
>

OK, actually I called this function in generate_union_paths. which handle
UNION case only. I will add the Assert anyway.

>
> +
> + /* Fast path */
> + if (innerrel->uniquekeys == NIL || outerrel->uniquekeys == NIL)
> + return;
> +
> + outer_is_onerow = relation_is_onerow(outerrel);
> + inner_is_onerow = relation_is_onerow(innerrel);
> +
> + outerrel_ukey_ctx = initililze_uniquecontext_for_joinrel(joinrel,
> outerrel);
> + innerrel_ukey_ctx = initililze_uniquecontext_for_joinrel(joinrel,
> innerrel);
> +
> + clause_list = gather_mergeable_joinclauses(joinrel, outerrel,
> innerrel,
> + restrictlist, jointype);
>
> Something similar happens in select_mergejoin_clauses().

I didn't realized this before. I will refactor this code accordingly if
necessary
after reading that.

> At least from the
> first reading of this patch, I get an impression that all these functions
> which
> are going through clause lists and index lists should be merged into other
> functions which go through these lists hunting for some information useful
> to
> the optimizer.
>
+
> +
> + if (innerrel_keeps_unique(root, outerrel, innerrel, clause_list,
> false))
> + {
> + foreach(lc, innerrel_ukey_ctx)
> + {
> + UniqueKeyContext ctx = (UniqueKeyContext)lfirst(lc);
> + if (!list_is_subset(ctx->uniquekey->exprs,
> joinrel->reltarget->exprs))
> + {
> + /* The UniqueKey on baserel is not useful on the joinrel
> */
>
> A joining relation need not be a base rel always, it could be a join rel as
> well.
>

good catch.

>
> + ctx->useful = false;
> + continue;
> + }
> + if ((jointype == JOIN_LEFT || jointype == JOIN_FULL) &&
> !ctx->uniquekey->multi_nullvals)
> + {
> + /* Change the multi_nullvals to true at this case */
>
> Need a comment explaining this. Generally, I feel, this and other
> functions in
> this file need good comments explaining the logic esp. "why" instead of
> "what".

Exactly.

>
>
+ else if (inner_is_onerow)
> + {
> + /* Since rows in innerrel can't be duplicated AND if
> innerrel is onerow,
> + * the join result will be onerow also as well. Note:
> onerow implies
> + * multi_nullvals = false.
>
> I don't understand this comment. Why is there only one row in the other
> relation which can join to this row?
>

I guess you may miss the onerow special case if I understand correctly.
inner_is_onerow means something like "SELECT xx FROM t1 where uk = 1".
innerrel can't be duplicated means: t1.y = t2.pk; so the finally result
is onerow
as well. One of the overall query is SELECT .. FROM t1, t2 where t2.y =
t2.pk;

I explained more about onerow in the v7 README.unqiuekey document, just
copy
it here.

===
1. What is UniqueKey?
....
onerow is also a kind of UniqueKey which means the RelOptInfo will have 1
row at
most. it has a stronger semantic than others. like SELECT uk FROM t; uk is
normal unique key and may have different values.
SELECT colx FROM t WHERE uk = const. colx is unique AND we have only 1
value. This
field can used for innerrel_is_unique. and also be used as an optimization
for
this case. We don't need to maintain multi UniqueKey, we just maintain one
with
onerow = true and exprs = NIL.

onerow is set to true only for 2 cases right now. 1) SELECT .. FROM t WHERE
uk =
1; 2). SELECT aggref(xx) from t; // Without group by.
===

===
2. How is it maintained?
....
More considerations about onerow:
1. If relation with one row and it can't be duplicated, it is still possible
contains mulit_nullvas after outer join.
2. If the either UniqueKey can be duplicated after join, the can get one row
only when both side is one row AND there is no outer join.
3. Whenever the onerow UniqueKey is not a valid any more, we need to
convert one
row UniqueKey to normal unique key since we don't store exprs for one-row
relation. get_exprs_from_uniquekeys will be used here.
===

and 3. in the v7 implementation, the code struct is more clearer:)

>
> + }
> + /*
> + * Calculate max_colno in subquery. In fact we can check this with
> + * list_length(sub_final_rel->reltarget->exprs), However, reltarget
> + * is not set on UPPERREL_FINAL relation, so do it this way
> + */
>
>
> Should/can we use the same logic to convert an expression in the subquery
> into
> a Var of the outer query as in convert_subquery_pathkeys().

Yes, my previous implementation is actually wrong. and should be fixed it
in v7.

> If the subquery doesn't have a reltarget set, we should be able to use
> reltarget

of any of its paths since all of those should match the positions across
> subquery

and the outer query.
>

but I think it should be rel->subroot->processed_tlist rather than
reltarget? Actually I still
a bit of uneasy about rel->subroot->processed_tlist for some DML case,
which the
processed_tlist is different and I still not figure out its impact.

> Will continue reviewing your new set of patches as time permits.
>

Thank you! Actually there is no big difference between v6 and v7 regarding
the
UniqueKey part except 2 bug fix. However v7 has some more documents,
comments improvement and code refactor/split, which may be helpful
for review. You may try v7 next time if v8 has not come yet:)

Best Regards
Andy Fan

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Fujii Masao 2020-05-08 02:31:42 Re: +(pg_lsn, int8) and -(pg_lsn, int8) operators
Previous Message Alvaro Herrera 2020-05-08 01:47:34 Re: Logical replication subscription owner