Re: [PATCH] Erase the distinctClause if the result is unique by definition

From: David Rowley <dgrowleyml(at)gmail(dot)com>
To: Andy Fan <zhihui(dot)fan1213(at)gmail(dot)com>
Cc: Ashutosh Bapat <ashutosh(dot)bapat(dot)oss(at)gmail(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Julien Rouhaud <rjuju123(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: [PATCH] Erase the distinctClause if the result is unique by definition
Date: 2020-03-13 03:46:27
Message-ID: CAApHDvq87xncP8bmOmQuxXpaiFzxbm2zWAL2ndUyk2YwDuvxMw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Fri, 13 Mar 2020 at 14:47, Andy Fan <zhihui(dot)fan1213(at)gmail(dot)com> wrote:
> 1. for pupulate_baserel_uniquekeys, we need handle the "pk = Const" as well.
> (relation_has_unqiue_for has a similar logic) currently the following distinct path is still
> there.

Yeah, I left a comment in propagate_unique_keys_to_joinrel() to
mention that still needs to be done.

> postgres=# explain select distinct b from t100 where pk = 1;
> QUERY PLAN
> ----------------------------------------------------------------------------------
> Unique (cost=8.18..8.19 rows=1 width=4)
> -> Sort (cost=8.18..8.19 rows=1 width=4)
> Sort Key: b
> -> Index Scan using t100_pkey on t100 (cost=0.15..8.17 rows=1 width=4)
> Index Cond: (pk = 1)
> (5 rows)
>
> I think in this case, we can add both (pk) and (b) as the UniquePaths. If so we
> can get more opportunities to reach our goal.

The UniqueKeySet containing "b" could only be added in the
distinct_rel in the upper planner. It must not change the input_rel
for the distinct.

It's likely best to steer clear of calling UniqueKeys UniquePaths as
it might confuse people. The term "path" is used in PostgreSQL as a
lightweight representation containing all the information required to
build a plan node in createplan.c. More details in
src/backend/optimizer/README.

> 2. As for the propagate_unique_keys_to_joinrel, we can add 1 more UniquePath as
> (rel1_unique_paths, rel2_unique_paths) if the current rules doesn't apply.
> or else the following cases can't be handled.
>
> postgres=# explain select distinct t100.pk, t101.pk from t100, t101;
> QUERY PLAN
> --------------------------------------------------------------------------------
> Unique (cost=772674.11..810981.11 rows=5107600 width=8)
> -> Sort (cost=772674.11..785443.11 rows=5107600 width=8)
> Sort Key: t100.pk, t101.pk
> -> Nested Loop (cost=0.00..63915.85 rows=5107600 width=8)
> -> Seq Scan on t100 (cost=0.00..32.60 rows=2260 width=4)
> -> Materialize (cost=0.00..43.90 rows=2260 width=4)
> -> Seq Scan on t101 (cost=0.00..32.60 rows=2260 width=4)
> (7 rows)

I don't really follow what you mean here. It seems to me there's no
way we can skip doing DISTINCT in the case above. If you've just
missed out the join clause and you meant to have "WHERE t100.pk =
t101.pk", then we can likely fix that later with some sort of
functional dependency tracking. Likely we can just add a Relids field
to UniqueKeySet to track which relids are functionally dependant on a
row from the UniqueKeySet's uk_exprs. That might be as simple as just
pull_varnos from the non-matched exprs and checking to ensure the
result is a subset of functionally dependant rels. I'd need to give
that more thought.

Was this a case you had working in your patch?

> But if we add such rule, the unique paths probably become much longer, so we need
> a strategy to tell if the UniquePath is useful for our query, if not, we can ignore that.
> rel->reltarget maybe a good info for such optimization. I think we can take this into
> consideration for pupulate_baserel_uniquekeys as well.

I don't really think the number of unique indexes in a base rel will
really ever get out of hand for legitimate cases.
propagate_unique_keys_to_joinrel is just concatenating baserel
UniqueKeySets to the joinrel. They're not copied, so it's just tagging
pointers onto the end of an array, which is at best a memcpy(), or at
worst a realloc() then memcpy(). That's not so costly.

> For the non_null info, Tom suggested to add maintain such info RelOptInfo,
> I have done that for the not_null_info for basic relation catalog, I think we can
> maintain the same flag for joinrel and the not null info from find_nonnullable_vars as
> well, but I still didn't find a good place to add that so far.

I'd considered just adding a get_notnull() function to lsyscache.c.
Just below get_attname() looks like a good spot. I imagined just
setting the bit in the UniqueKeySet's non_null_keys field
corresponding to the column position from the index. I could see the
benefit of having a field in RelOptInfo if there was some way to
determine the not-null properties of all columns in the table at once,
but there's not, so we're likely best just looking at the ones that
there are unique indexes on.

> A small question about the following code:
>
> + if (relation_has_uniquekeys_for(root, input_rel, get_sortgrouplist_exprs(parse->distinctClause, parse->targetList), false))
> + {
> +
> + add_path(distinct_rel, (Path *)cheapest_input_path);
> +
> + /* XXX yeah yeah, need to call the hooks etc. */
> +
> + /* Now choose the best path(s) */
> + set_cheapest(distinct_rel);
> +
> + return distinct_rel;
> + }
>
> Since we don't create new RelOptInfo/Path, do we need to call add_path and set_cheapest?

The distinct_rel already exists. add_path() is the standard way we
have of adding paths to the rel's pathlist. Why would you want to
bypass that? set_cheapest() is our standard way of looking at the
pathlist and figuring out the least costly one. It's not a very hard
job to do when there's just 1 path. Not sure why you'd want to do it
another way.

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Jürgen Purtz 2020-03-13 04:18:40 Re: Add A Glossary
Previous Message movead.li@highgo.ca 2020-03-13 03:18:47 Re: Re: A bug when use get_bit() function for a long bytea string