Re: [PATCH] Keeps tracking the uniqueness with UniqueKey

From: Andy Fan <zhihui(dot)fan1213(at)gmail(dot)com>
To: David Rowley <dgrowleyml(at)gmail(dot)com>
Cc: PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Ashutosh Bapat <ashutosh(dot)bapat(dot)oss(at)gmail(dot)com>, rushabh(dot)lathia(at)gmail(dot)com
Subject: Re: [PATCH] Keeps tracking the uniqueness with UniqueKey
Date: 2020-04-03 02:17:11
Message-ID: CAKU4AWpoA2qCS=rS-7Qa38Fp3fwjwc9Gn9PgSfjC2V5uzBsDPg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

The updated patch should fixed all the issues. See the comments below for
more
information.

On Tue, Mar 31, 2020 at 9:44 AM David Rowley <dgrowleyml(at)gmail(dot)com> wrote:

> On Sun, 29 Mar 2020 at 20:50, Andy Fan <zhihui(dot)fan1213(at)gmail(dot)com> wrote:
> > Some other changes made in the new patch:
> > 1. Fixed bug for UniqueKey calculation for OUTER join.
> > 2. Fixed some typo error in comments.
> > 3. Renamed the field "grantee" as "guarantee".
>
> I've had a look over this patch. Thank for you doing further work on it.
>
> I've noted down the following during my read of the code:
>
> 1. There seem to be some cases where joins are no longer being
> detected as unique. This is evident in postgres_fdw.out. We shouldn't
> be regressing any of these cases.
>

The issue here is I didn't distinguish the one_row case in UniqueKey
struct.
Actually when a outer relation is join with a relation which has only one
row,
there must be at most one row match the outer join.The only-one-row case in
postgres_fdw.out come from aggregation call.

Added a new field "onerow" in UniqueKey struct. and optimize the onerow
UniqueKey to not record every exprs. See add_uniquekey_for_onerow
and relation_is_onerow.

> 2. The following change does not seem like it should be part of this
> patch. I understand you perhaps have done as you think it will
> improve the performance of checking if an expression is in a list of
> expressions.
>
> - COMPARE_SCALAR_FIELD(varno);
> + /* Compare varattno first since it has higher selectivity than varno */
> COMPARE_SCALAR_FIELD(varattno);
> + COMPARE_SCALAR_FIELD(varno);
>
> If you think that is true, then please do it as a separate effort and
> provide benchmarks with your findings.
>
> Rollbacked.

> 3. list_all_members_in. I think this would be better named as
> list_is_subset. Please follow the lead of bms_is_subset().
> Additionally, you should Assert that IsPointerList is true as there's
> nothing else to indicate that it can't be used for an int or Oid list.
>
> Done

> 4. guarantee is not a very good name for the field in UniqueKey.
> Maybe something like is_not_null?
>
> I tried is_not_null, but when it is is_not_null equals false, it is a
double
negation, and not feel good for me. At last, I used multi_nullvals to show
the UniqueKey may yield multi null values so the uniqueness is not
guaranteed.

> 5. I think you should be performing a bms_del_member during join
> removal rather than removing this Assert()
>
> - Assert(bms_equal(rel->relids, root->all_baserels));

Done

>
> FWIW, it's far from perfect that you've needed to delay the left join
> removal, but I do understand why you've done it. It's also far from
> perfect that you're including removed relations in the
> total_table_pages calculation. c6e4133fae1 took some measures to
> improve this calculation and this is making it worse again.
>
> 6. Can you explain why you moved the populate_baserel_uniquekeys()
> call out of set_plain_rel_size()?
>
> 7. I don't think the removal of rel_supports_distinctness() is
> warranted. Is it not ok to check if the relation has any uniquekeys?
> It's possible, particularly in join_is_removable that this can save
> quite a large amount of effort.
>
> Done

> 8. Your spelling of unique is incorrect in many places:
>
> src/backend/nodes/makefuncs.c: * makeUnqiueKey
> src/backend/optimizer/path/uniquekeys.c:static List
> *initililze_unqiuecontext_for_joinrel(RelOptInfo *joinrel,
> src/backend/optimizer/path/uniquekeys.c: * check if combination of
> unqiuekeys from both side is still useful for us,
> src/backend/optimizer/path/uniquekeys.c: outerrel_uniquekey_ctx
> = initililze_unqiuecontext_for_joinrel(joinrel, outerrel);
> src/backend/optimizer/path/uniquekeys.c: innerrel_uniquekey_ctx
> = initililze_unqiuecontext_for_joinrel(joinrel, innerrel);
> src/backend/optimizer/path/uniquekeys.c: * we need to convert the
> UnqiueKey from sub_final_rel to currel via the positions info in
> src/backend/optimizer/path/uniquekeys.c: ctx->pos =
> pos; /* the position in current targetlist, will be used to set
> UnqiueKey */
> src/backend/optimizer/path/uniquekeys.c: * Check if Unqiue key of the
> innerrel is valid after join. innerrel's UniqueKey
> src/backend/optimizer/path/uniquekeys.c: *
> initililze_unqiuecontext_for_joinrel
> src/backend/optimizer/path/uniquekeys.c: * all the unqiuekeys which
> are not possible to use later
>
> src/backend/optimizer/path/uniquekeys.c:initililze_unqiuecontext_for_joinrel(RelOptInfo
> *joinrel, RelOptInfo *inputrel)
> src/backend/optimizer/plan/analyzejoins.c: /*
> This UnqiueKey is what we want */
> src/backend/optimizer/plan/planner.c: /* If we the result if unqiue
> already, we just return the input_rel directly */
> src/include/nodes/pathnodes.h: * exprs is a list of exprs which is
> unqiue on current RelOptInfo.
> src/test/regress/expected/join.out:-- XXXX: since b.id is unqiue now
> so the group by cluase is erased, so
> src/test/regress/expected/select_distinct.out:-- create unqiue index on
> dist_p
> src/test/regress/expected/select_distinct.out:-- we also support
> create unqiue index on each child tables
> src/test/regress/sql/join.sql:-- XXXX: since b.id is unqiue now so the
> group by cluase is erased, so
> src/test/regress/sql/select_distinct.sql:-- create unqiue index on dist_p
> src/test/regress/sql/select_distinct.sql:-- we also support create
> unqiue index on each child tables
>
9. A few things wrong with the following fragment:
>
> /* set the not null info now */
> ListCell *lc;
> foreach(lc, find_nonnullable_vars(qual))
> {
> Var *var = lfirst_node(Var, lc);
> RelOptInfo *rel = root->simple_rel_array[var->varno];
> if (var->varattno > InvalidAttrNumber)
> rel->not_null_cols = bms_add_member(rel->not_null_cols, var->varattno);
> }
>
> a. including a function call in the foreach macro is not a practise
> that we really follow. It's true that the macro now assigns the 2nd
> param to a variable. Previous to 1cff1b95ab6 this was not the case and
> it's likely best not to leave any bad examples around that code which
> might get backported might follow.
> b. We generally subtract InvalidAttrNumber from varattno when
> including in a Bitmapset.
> c. not_null_cols is not well named. I think notnullattrs
> d. not_null_cols should not be a Relids type, it should be Bitmapset.
>
> Above 2 Done

> 10. add_uniquekey_for_onerow() seems pretty wasteful. Is there really
> a need to add each item in the rel's targetlist to the uniquekey list?
> What if we just add an empty list to the unique keys, that way if we
> need to test if some expr is a superset of any uniquekey, then we'll
> see it is as any set is a superset of an empty set. Likely the empty
> set of uniquekeys should be the only one in the rel's uniquekey list.
>
>
Now I use a single UniqueKey to show this situation. See
add_uniquekey_for_onerow and relation_is_onerow.

> 11. In create_distinct_paths() the code is now calling
> get_sortgrouplist_exprs() multiple times with the same input. I think
> it would be better to just call it once and set the result in a local
> variable.
>
> Done

> 12. The comment in the code below is not true. The List contains
> Lists, of which contain UniqueKeys
>
> List *uniquekeys; /* List of UniqueKey */
>
> 13. I'm having trouble parsing the final sentence in:
>
> + * can only guarantee the uniqueness without considering the null values.
> This
> + * field is necessary for remove_useless_join & reduce_unique_semijions
> since
> + * these cases don't care about the null values.
>
> Why is the field which stores the nullability of the key required for
> code that does not care about the nullability of the key?
>
> Also please check your spelling of the word "join"
>
>
Actually I didn't find the spell error for "join"..

> 14. In the following fragment, instead of using i+1, please assign the
> FormData_pg_attribute to a variable named attr and use attr->attnum.
> Also, please see what I mentioned above about subtracting
> InvalidAttrNumber
>
> + rel->not_null_cols = bms_add_member(rel->not_null_cols, i+1);
>
> Done

> 15. The tests you've changed the expected outcome of in join.out
> should be updated so that the GROUP BY and DISTINCT clause is not
> removed. This will allow the test to continue testing what it was
> intended to test. You can do this by changing the columns in the GROUP
> BY clause so that the new code does not find uniquekeys for those
> columns.
>
> Done

> 16. The tests in aggregates.out are in a similar situation. There are
> various tests trying to ensure that remove_useless_groupby_columns()
> does what it's meant to do. You can modify these tests to add a join
> which is non-unique to effectively duplicate the PK column.
>
>
There are some exceptions at this part.
1. The test for remove_useless_groupby_columns has some overlap
with our current erasing group node logic, like the test for a single
relation.
so I just modified 2 tests case for this purpose.
2. When I read the code in remove_useless_groupby_columns, I found a
new case for our UniqueKey.
select * from m1 where a > (select avg(b) from m2 group by *M1.A*);
where the m1.a will have var->varlevelsup > 0, how should we set the
UniqueKey
for this grouprel . I add some in-completed check at
add_uniquekey_from_sortgroups
function. but I'm not sure if we need that.
3. remove_useless_groupby_columns maintains the parse->constraintDeps
when it depends on primary key, but UniqueKey doesn't maintain such data.
since we have translation layer which should protect us from the
concurrency issue
and isolation issue. Do we need to do that as well in UniqueKey?

> 17. In your select_distinct tests, can you move away from naming the
> tables starting with select_distinct? It makes reading queries pretty
> hard.
>
> e.g. explain (costs off) select distinct uk1, uk2 from
> select_distinct_a where uk2 is not null;
>
> When I first glanced that, I failed to see the underscores and the
> query looked invalid.

>
18. Check the spelling if "erased". You have it spelt as "ereased" in
> a couple of locations.
>
> 19. Please pay attention to the capitalisation of SQL keywords in the
> test files you've modified. I understand we're very inconsistent in
> this department in general, but we do at least try not to mix
> capitalisation within the same file. Basically, please upper case the
> keywords in select_distinct.sql
>
> 20. In addition to the above, please try to wrap long SQL lines so
> they're below 80 chars.
>
> All above 4 item Done.

> I'll review the patch in more detail once the above points have been
> addressed.

>
David
>

Attachment Content-Type Size
v4-0001-Maintain-UniqueKey-at-each-RelOptInfo-this-inform.patch application/octet-stream 86.5 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2020-04-03 02:25:39 Re: Syntax rules for a text value inside the literal for a user-defined type—doc section “8.16.2. Constructing Composite Values”
Previous Message Bryn Llewellyn 2020-04-03 01:56:35 Syntax rules for a text value inside the literal for a user-defined type—doc section “8.16.2. Constructing Composite Values”