Re: [PATCH] Keeps tracking the uniqueness with UniqueKey

From: Ashutosh Bapat <ashutosh(dot)bapat(dot)oss(at)gmail(dot)com>
To: Andy Fan <zhihui(dot)fan1213(at)gmail(dot)com>
Cc: David Rowley <dgrowleyml(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, rushabh(dot)lathia(at)gmail(dot)com, Ashutosh Bapat <ashutosh(dot)bapat(at)2ndquadrant(dot)com>
Subject: Re: [PATCH] Keeps tracking the uniqueness with UniqueKey
Date: 2020-05-07 11:26:27
Message-ID: CAExHW5t_F+SDrZ6HL8mr812zbUX9sySYSqVGOe=fk_sVnzRuZA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi Andy,
Sorry for delay in review. Your earlier patches are very large and it
requires some time to review those. I didn't finish reviewing those
but here are whatever comments I have till now on the previous set of
patches. Please see if any of those are useful to the new set.

+/*
+ * Return true iff there is an equal member in target for every
+ * member in members

Suggest reword: return true iff every entry in "members" list is also present
in the "target" list. This function doesn't care about multi-sets, so please
mention that in the prologue clearly.

+
+ if (root->parse->hasTargetSRFs)
+ return;

Why? A relation's uniqueness may be useful well before we work on SRFs.

+
+ if (baserel->reloptkind == RELOPT_OTHER_MEMBER_REL)
+ /*
+ * Set UniqueKey on member rel is useless, we have to recompute it at
+ * upper level, see populate_partitionedrel_uniquekeys for reference
+ */
+ return;

Handling these here might help in bottom up approach. We annotate each
partition here and then annotate partitioned table based on the individual
partitions. Same approach can be used for partitioned join produced by
partitionwise join.

+ /*
+ * If the unique index doesn't contain partkey, then it is unique
+ * on this partition only, so it is useless for us.
+ */

Not really. It could help partitionwise join.

+
+ /* Now we have the unique index list which as exactly same on all
childrels,
+ * Set the UniqueIndex just like it is non-partition table
+ */

I think it's better to annotate each partition with whatever unique index it
has whether or not global. That will help partitionwise join, partitionwise
aggregate/group etc.

+ /* A Normal group by without grouping set */
+ if (parse->groupClause)
+ add_uniquekey_from_sortgroups(root,
+ grouprel,
+ root->parse->groupClause);

Those keys which are part of groupClause and also form unique keys in the input
relation, should be recorded as unique keys in group rel. Knowing the minimal
set of keys allows more optimizations.

+
+ foreach(lc, unionrel->reltarget->exprs)
+ {
+ exprs = lappend(exprs, lfirst(lc));
+ colnos = lappend_int(colnos, i);
+ i++;
+ }

This should be only possible when it's not UNION ALL. We should add some assert
or protection for that.

+
+ /* Fast path */
+ if (innerrel->uniquekeys == NIL || outerrel->uniquekeys == NIL)
+ return;
+
+ outer_is_onerow = relation_is_onerow(outerrel);
+ inner_is_onerow = relation_is_onerow(innerrel);
+
+ outerrel_ukey_ctx = initililze_uniquecontext_for_joinrel(joinrel,
outerrel);
+ innerrel_ukey_ctx = initililze_uniquecontext_for_joinrel(joinrel,
innerrel);
+
+ clause_list = gather_mergeable_joinclauses(joinrel, outerrel, innerrel,
+ restrictlist, jointype);

Something similar happens in select_mergejoin_clauses(). At least from the
first reading of this patch, I get an impression that all these functions which
are going through clause lists and index lists should be merged into other
functions which go through these lists hunting for some information useful to
the optimizer.

+
+
+ if (innerrel_keeps_unique(root, outerrel, innerrel, clause_list, false))
+ {
+ foreach(lc, innerrel_ukey_ctx)
+ {
+ UniqueKeyContext ctx = (UniqueKeyContext)lfirst(lc);
+ if (!list_is_subset(ctx->uniquekey->exprs,
joinrel->reltarget->exprs))
+ {
+ /* The UniqueKey on baserel is not useful on the joinrel */

A joining relation need not be a base rel always, it could be a join rel as
well.

+ ctx->useful = false;
+ continue;
+ }
+ if ((jointype == JOIN_LEFT || jointype == JOIN_FULL) &&
!ctx->uniquekey->multi_nullvals)
+ {
+ /* Change the multi_nullvals to true at this case */

Need a comment explaining this. Generally, I feel, this and other functions in
this file need good comments explaining the logic esp. "why" instead of "what".

+ else if (inner_is_onerow)
+ {
+ /* Since rows in innerrel can't be duplicated AND if
innerrel is onerow,
+ * the join result will be onerow also as well. Note:
onerow implies
+ * multi_nullvals = false.

I don't understand this comment. Why is there only one row in the other
relation which can join to this row?

+ }
+ /*
+ * Calculate max_colno in subquery. In fact we can check this with
+ * list_length(sub_final_rel->reltarget->exprs), However, reltarget
+ * is not set on UPPERREL_FINAL relation, so do it this way
+ */

Should/can we use the same logic to convert an expression in the subquery into
a Var of the outer query as in convert_subquery_pathkeys(). If the subquery
doesn't have a reltarget set, we should be able to use reltarget of any of its
paths since all of those should match the positions across subquery and the
outer query.

Will continue reviewing your new set of patches as time permits.

On Thu, May 7, 2020 at 7:02 AM Andy Fan <zhihui(dot)fan1213(at)gmail(dot)com> wrote:
>
> I just uploaded the v7 version and split it into smaller commits for easier
> review/merge. I also maintain a up-to-date README.uniquekey
> document since something may changed during discussion or later code.
>
> Here is the simple introduction of each commit.
>
> ====
> 1. v7-0001-Introduce-RelOptInfo-notnullattrs-attribute.patch
>
> This commit adds the notnullattrs to RelOptInfo, which grabs the information
> from both catalog and user's query.
>
>
> 2. v7-0002-Introuduce-RelOptInfo.uniquekeys-attribute.patch
>
> This commit just add the uniquekeys to RelOptInfo and maintain it at every
> stage. However the upper level code is not changed due to this.
>
> Some changes of this part in v7:
> 1). Removed the UniqueKey.positions attribute. In the past it is used in
> convert_subquery_uniquekeys, however we don't need it actually (And I
> maintained it wrong in the past). Now I build the relationship between the
> outer var to subuqery's TargetList with outrel.subquery.processed_tlist.
> 2). onerow UniqueKey(exprs = NIL) need to be converted to normal uniquekey(exprs
> != NIL) if it is not one-row any more. This may happen on some outer join.
>
>
> 3. v7-0003-Refactor-existing-uniqueness-related-code-to-use-.patch
>
> Refactor the existing functions like innerrel_is_unique/res_is_distinct_for to
> use UniqueKey, and postpone the call of remove_useless_join and
> reduce_unique_semijoins to use the new implementation.
>
> 4. v7-0004-Remove-distinct-node-AggNode-if-the-input-is-uniq.patch
>
> Remove the distinct node if the result is distinct already. Remove the aggnode
> if the group by clause is unique already AND there is no aggregation function in
> query.
>
> 5. v7-0005-If-the-group-by-clause-is-unique-and-we-have-aggr.patch
>
> If the group by clause is unique and query has aggregation function, we use
> the AGG_SORT strategy but without really sort since it has only one row in each
> group.
>
>
> 6. v7-0006-Join-removal-at-run-time-with-UniqueKey.patch
>
> This commit run join removal at build_join_rel. At that time, it can fully uses
> unique key. It can handle some more cases, I added some new test cases to
> join.sql. However it can be a replacement of the current one. There are some
> cases the new strategy can work run well but the current one can. Like
>
> SELECT a.* FROM a LEFT JOIN (b left join c on b.c_id = c.id) ON (a.b_id = b.id);
>
> during the join a & b, the join can't be removed since b.id is still useful in
> future. However in the future, we know the b.id can be removed as well, but
> it is too late to remove the previous join.
>
> At the implementation part, the main idea is if the join_canbe_removed. we
> will copy the pathlist from outerrel to joinrel. There are several items need to
> handle.
>
> 1. To make sure the overall join_search_one_level, we have to keep the joinrel
> even the innerrel is removed (rather than discard the joinrel).
> 2. If the innerrel can be removed, we don't need to build pathlist for joinrel,
> we just reuse the pathlist from outerrel. However there are many places where
> use assert rel->pathlist[*]->parent == rel. so I copied the pathlist, we
> have to change the parent to joinrel.
> 3. During create plan for some path on RTE_RELATION, it needs to know the
> relation Oid with path->parent->relid. so we have to use the outerrel->relid
> to overwrite the joinrel->relid which is 0 before.
> 4. Almost same paths as item 3, it usually assert best_path->parent->rtekind ==
> RTE_RELATION; now the path may appeared in joinrel, so I used
> outerrel->rtekind to overwrite joinrel->rtekind.
> 5. I guess there are some dependencies between path->pathtarget and
> rel->reltarget. since we reuse the pathlist of outerrel, so I used the
> outer->reltarget as well. If the join can be removed, I guess the length of
> list_length(outrel->reltarget->exprs) >= (joinrel->reltarget->exprs). we can
> rely on the ProjectionPath to reduce the tlist.
>
> My patches is based on the current latest commit fb544735f1.
>
> Best Regards
> Andy Fan

--
Best Wishes,
Ashutosh Bapat

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Ranier Vilela 2020-05-07 11:32:13 Re: Postgres Windows build system doesn't work with python installed in Program Files
Previous Message 曾文旌 2020-05-07 11:12:24 Re: [Proposal] Global temporary tables