Quick Links

An improvement on parallel DISTINCT

From:	Richard Guo <guofenglinux(at)gmail(dot)com>
To:	PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	An improvement on parallel DISTINCT
Date:	2023-12-26 11:23:02
Message-ID:	CAMbWs48u9VoVOouJsys1qOaC9WVGVmBa+wT1dx8KvxF5GPzezA@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

While reviewing Heikki's Omit-junk-columns patchset[1], I noticed that
root->upper_targets[] is used to set target for partial_distinct_rel,
which is not great because root->upper_targets[] is not supposed to be
used by the core code. The comment in grouping_planner() says:

* Save the various upper-rel PathTargets we just computed into
* root->upper_targets[]. The core code doesn't use this, but it
* provides a convenient place for extensions to get at the info.

Then while fixing this issue, I noticed an opportunity for improvement
in how we generate Gather/GatherMerge paths for the two-phase DISTINCT.
The Gather/GatherMerge paths are added by generate_gather_paths(), which
does not consider ordering that might be useful above the GatherMerge
node. This can be improved by using generate_useful_gather_paths()
instead. With this change I can see query plan improvement from the
regression test "select_distinct.sql". For instance,

-- Test parallel DISTINCT
SET parallel_tuple_cost=0;
SET parallel_setup_cost=0;
SET min_parallel_table_scan_size=0;
SET max_parallel_workers_per_gather=2;

-- Ensure we get a parallel plan
EXPLAIN (costs off)
SELECT DISTINCT four FROM tenk1;

-- on master
EXPLAIN (costs off)
SELECT DISTINCT four FROM tenk1;
QUERY PLAN
----------------------------------------------------
Unique
-> Sort
Sort Key: four
-> Gather
Workers Planned: 2
-> HashAggregate
Group Key: four
-> Parallel Seq Scan on tenk1
(8 rows)

-- on patched
EXPLAIN (costs off)
SELECT DISTINCT four FROM tenk1;
QUERY PLAN
----------------------------------------------------
Unique
-> Gather Merge
Workers Planned: 2
-> Sort
Sort Key: four
-> HashAggregate
Group Key: four
-> Parallel Seq Scan on tenk1
(8 rows)

I believe the second plan is better.

Attached is a patch that includes this change and also eliminates the
usage of root->upper_targets[] in the core code. It also makes some
tweaks for the comment.

Any thoughts?

[1]
https://www.postgresql.org/message-id/flat/2ca5865b-4693-40e5-8f78-f3b45d5378fb%40iki.fi

Thanks
Richard

Attachment	Content-Type	Size
v1-0001-Improve-parallel-DISTINCT.patch	application/octet-stream	4.6 KB

Responses

Re: An improvement on parallel DISTINCT at 2024-02-02 03:26:18 from David Rowley

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Andrei Lepikhov	2023-12-26 11:37:01	Re: POC: GROUP BY optimization
Previous Message	Zhijie Hou (Fujitsu)	2023-12-26 11:09:57	RE: Synchronizing slots from primary to standby