Re: [PATCH] Keeps tracking the uniqueness with UniqueKey

From: Andy Fan <zhihui(dot)fan1213(at)gmail(dot)com>
To: David Rowley <dgrowleyml(at)gmail(dot)com>
Cc: PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Ashutosh Bapat <ashutosh(dot)bapat(dot)oss(at)gmail(dot)com>, rushabh(dot)lathia(at)gmail(dot)com
Subject: Re: [PATCH] Keeps tracking the uniqueness with UniqueKey
Date: 2020-04-16 02:17:00
Message-ID: CAKU4AWoymsjbm5KDYbsko13GUfG57pX1QyC3Y8sDHyrfoQeyQQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, Apr 15, 2020 at 11:00 AM David Rowley <dgrowleyml(at)gmail(dot)com> wrote:

> On Wed, 15 Apr 2020 at 12:19, Andy Fan <zhihui(dot)fan1213(at)gmail(dot)com> wrote:
> >
> >> 2. I think 0002 is overly restrictive in its demands that
> >> parse->hasAggs must be false. We should be able to just use a Group
> >> Aggregate with unsorted input when the input_rel is unique on the
> >> GROUP BY clause. This will save on hashing and sorting. Basically
> >> similar to what we do for when a query contains aggregates without any
> >> GROUP BY.
> >>
> >
> > Yes, This will be a perfect result, the difficult is the current
> aggregation function
> > execution is highly coupled with Agg node(ExecInitAgg) which is removed
> in the
> > unique case.
>
> This case here would be slightly different. It would be handled by
> still creating a Group Aggregate path, but just not consider Hash
> Aggregate and not Sort the input to the Group Aggregate path. Perhaps
> that's best done by creating a new flag bit and using it in
> create_grouping_paths() in the location where we set the flags
> variable. If you determine that the input_rel is unique for the GROUP
> BY clause, and that there are aggregate functions, then set a flag,
> e.g GROUPING_INPUT_UNIQUE. Likely there will be a few other flags that
> you can skip setting in that function, for example, there's no need to
> check if the input can sort, so no need for GROUPING_CAN_USE_SORT,
> since you won't need to sort, likewise for GROUPING_CAN_USE_HASH. I'd
> say there also is no need for checking if we can set
> GROUPING_CAN_PARTIAL_AGG (What would be the point in doing partial
> aggregation when there's 1 row per group?) Then down in
> add_paths_to_grouping_rel(), just add a special case before doing any
> other code, such as:
>
> if ((extra->flags & GROUPING_INPUT_UNIQUE) != 0 && parse->groupClause !=
> NIL)
> {
> Path *path = input_rel->cheapest_total_path;
>
> add_path(grouped_rel, (Path *)
> create_agg_path(root,
> grouped_rel,
> path,
> grouped_rel->reltarget,
> AGG_SORTED,
> AGGSPLIT_SIMPLE,
> parse->groupClause,
> havingQual,
> agg_costs,
> dNumGroups));
> return;
> }
>
> You may also want to consider the cheapest startup path there too so
> that the LIMIT processing can do something smarter later in planning
> (assuming cheapest_total_path != cheapest_startup_path (which you'd
> need to check for)).
>
> Perhaps it would be better to only set the GROUPING_INPUT_UNIQUE if
> there is a groupClause, then just Assert(parse->groupClause != NIL)
> inside that if.
>
>
Thank you for your detailed explanation. The attached v6 has included
this feature.
Here is the the data to show the improvement.

Test cases:
create table grp2 (a int primary key, b char(200), c int);
insert into grp2 select i, 'x', i from generate_series(1, 10000000)i;
analyze grp2;
explain analyze select a, sum(c) from grp2 group by a;

w/o this feature:

postgres=# explain analyze select a, sum(c) from grp2 group by a;
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------------
GroupAggregate (cost=0.43..712718.44 rows=10000000 width=12) (actual
time=0.088..15491.027 rows=10000000 loops=1)
Group Key: a
-> Index Scan using grp2_pkey on grp2 (cost=0.43..562718.44
rows=10000000 width=8) (actual time=0.068..6503.459 rows=10000000 loops=1)
Planning Time: 0.916 ms
Execution Time: *16252.397* ms
(5 rows)

Since the order of my data in heap and index is exactly same, which makes
the index scan much faster. The following is to test the cost of the
*hash* aggregation,

postgres=# set enable_indexscan to off;
SET
postgres=# explain analyze select a, sum(c) from grp2 group by a;
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------
HashAggregate (cost=765531.00..943656.00 rows=10000000 width=12) (actual
time=14424.379..30133.171 rows=10000000 loops=1)
Group Key: a
Planned Partitions: 128
Peak Memory Usage: 4153 kB
Disk Usage: 2265608 kB
HashAgg Batches: 640
-> Seq Scan on grp2 (cost=0.00..403031.00 rows=10000000 width=8)
(actual time=0.042..2808.281 rows=10000000 loops=1)
Planning Time: 0.159 ms
Execution Time: *31098.804* ms
(9 rows)

With this feature:
explain analyze select a, sum(c) from grp2 group by a;
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------
GroupAggregate (cost=0.00..553031.57 rows=10000023 width=12) (actual
time=0.044..13209.485 rows=10000000 loops=1)
Group Key: a
-> Seq Scan on grp2 (cost=0.00..403031.23 rows=10000023 width=8)
(actual time=0.023..4938.171 rows=10000000 loops=1)
Planning Time: 0.400 ms
Execution Time: *13749.121* ms
(5 rows)

During the implementation, I also added AGG_UNIQUE AggStrategy to
record this information in Agg Plan node, this is a simple way to do it and
should be semantic correct.

-

V6 also includes:
1. Fix the comment misleading you mentioned above.
2. Fixed a concern case for `relation_has_uniquekeys_for` function.

+ /* For UniqueKey->onerow case, the uniquekey->exprs is empty as well
+ * so we can't rely on list_is_subset to handle this special cases
+ */
+ if (exprs == NIL)
+ return false;

Best Regards
Andy Fan

Attachment Content-Type Size
v6-0002-Skip-DISTINCT-GROUP-BY-if-input-is-already-unique.patch application/octet-stream 35.9 KB
v6-0001-Introduce-UniqueKeys-to-determine-RelOptInfo-uniq.patch application/octet-stream 64.7 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message David Rowley 2020-04-16 02:48:50 Re: remove_useless_groupby_columns does not need to record constraint dependencies
Previous Message James Coleman 2020-04-16 01:44:48 Re: [DOC] Document concurrent index builds waiting on each other