Re: An improvement on parallel DISTINCT

From: David Rowley <dgrowleyml(at)gmail(dot)com>
To: Richard Guo <guofenglinux(at)gmail(dot)com>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: An improvement on parallel DISTINCT
Date: 2024-02-02 03:26:18
Message-ID: CAApHDvqzba-Xy8YPdErVGfGhrV2YLCdrGoQafpuF7Sy8dOHVfQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, 27 Dec 2023 at 00:23, Richard Guo <guofenglinux(at)gmail(dot)com> wrote:
> -- on master
> EXPLAIN (costs off)
> SELECT DISTINCT four FROM tenk1;
> QUERY PLAN
> ----------------------------------------------------
> Unique
> -> Sort
> Sort Key: four
> -> Gather
> Workers Planned: 2
> -> HashAggregate
> Group Key: four
> -> Parallel Seq Scan on tenk1
> (8 rows)
>
> -- on patched
> EXPLAIN (costs off)
> SELECT DISTINCT four FROM tenk1;
> QUERY PLAN
> ----------------------------------------------------
> Unique
> -> Gather Merge
> Workers Planned: 2
> -> Sort
> Sort Key: four
> -> HashAggregate
> Group Key: four
> -> Parallel Seq Scan on tenk1
> (8 rows)
>
> I believe the second plan is better.

I wonder if this change is worthwhile. The sort is only required at
all because the planner opted to HashAggregate in phase1, of which the
rows are output unordered. If phase1 was done by Group Aggregate, then
no sorting would be needed. The only reason the planner didn't Hash
Aggregate for phase2 is because of the order we generate the distinct
paths and because of STD_FUZZ_FACTOR.

Look at the costs of the above plan:

Unique (cost=397.24..397.28 rows=4 width=4)

if I enable_sort=0; then I get a cheaper plan:

HashAggregate (cost=397.14..397.18 rows=4 width=4)

If we add more rows then the cost of sorting will grow faster than the
cost of hash aggregate due to the O(N log2 N) part of our sort
costing.

If I drop the index on tenk1(hundred), I only need to go to the
"hundred" column to have it switch to Hash Aggregate on the 2nd phase.
This is because the number of distinct groups costs the paths for
Group Aggregate and Hash Aggregate more than STD_FUZZ_FACTOR apart.
Adjusting the STD_FUZZ_FACTOR with the following means Hash Aggregate
is used for both phases.

-#define STD_FUZZ_FACTOR 1.01
+#define STD_FUZZ_FACTOR 1.0000001

In light of this, do you still think it's worthwhile making this change?

For me, I think all it's going to result in is extra planner work
without any performance gains.

> Attached is a patch that includes this change and also eliminates the
> usage of root->upper_targets[] in the core code. It also makes some
> tweaks for the comment.

We should fix that. We can consider it independently from the other
change you're proposing.

David

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Richard Guo 2024-02-02 03:32:17 Re: POC: GROUP BY optimization
Previous Message torikoshia 2024-02-02 02:29:41 Re: Small fix on COPY ON_ERROR document