BUG #16635: Query Optimizer - Performance optimization for the UNION function

From: PG Bug reporting form <noreply(at)postgresql(dot)org>
To: pgsql-bugs(at)lists(dot)postgresql(dot)org
Cc: XINYULIU(at)UMICH(dot)EDU
Subject: BUG #16635: Query Optimizer - Performance optimization for the UNION function
Date: 2020-09-25 03:39:16
Message-ID: 16635-b06e1ff6be5a1975@postgresql.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

The following bug has been logged on the website:

Bug reference: 16635
Logged by: XINYU LIU
Email address: XINYULIU(at)UMICH(dot)EDU
PostgreSQL version: 13.0
Operating system: Ubuntu 20.04
Description:

Hello,

We are sharing a pair of TPC-H queries that show a potential performance
optimization in this report:

First query:
SELECT "o_orderstatus",
"o_clerk",
"o_comment"
FROM "orders"
WHERE "o_orderstatus" LIKE '%'
|| 'F'
UNION
SELECT "o_orderstatus",
"o_clerk",
"o_comment"
FROM "orders";

Second query:
SELECT "o_orderstatus",
"o_clerk",
"o_comment"
FROM (SELECT "o_orderstatus",
"o_clerk",
"o_comment"
FROM "orders"
WHERE "o_orderstatus" LIKE '%'
|| 'F'
UNION ALL
SELECT "o_orderstatus",
"o_clerk",
"o_comment"
FROM "orders") AS "t6"
GROUP BY "o_orderstatus",
"o_clerk",
"o_comment";

[Actual Behavior]
We executed both queries on the TPC-H benchmark of scale factor 5: the first
query takes 40 seconds, while the second query only takes 15 seconds. We
think the time difference results from different plans selected.

[Query Execution Plan]
First query:
QUERY PLAN

-----------------------------------------------------------------------------------------------------------------------------------------------
Unique (cost=5807645.69..5919521.71 rows=11187602 width=248) (actual
time=32713.029..40265.378 rows=7499932 loops=1)
-> Sort (cost=5807645.69..5835614.69 rows=11187602 width=248) (actual
time=32713.028..38766.718 rows=11154501 loops=1)
Sort Key: orders.o_orderstatus, orders.o_clerk, orders.o_comment
Sort Method: external merge Disk: 854080kB
-> Append (cost=0.00..597462.95 rows=11187602 width=248) (actual
time=0.380..2995.901 rows=11154501 loops=1)
-> Seq Scan on orders (cost=0.00..224198.29 rows=3688539
width=67) (actual time=0.379..1420.758 rows=3654501 loops=1)
Filter: (o_orderstatus ~~ '%F'::text)
Rows Removed by Filter: 3845499
-> Seq Scan on orders orders_1 (cost=0.00..205450.63
rows=7499063 width=67) (actual time=0.016..984.347 rows=7500000 loops=1)
Planning Time: 5.560 ms
Execution Time: 40517.865 ms
(11 rows)

Second query:
QUERY
PLAN

--------------------------------------------------------------------------------------------------------------------------------------------------------------
Finalize HashAggregate (cost=1210086.17..1247494.70 rows=1118760 width=67)
(actual time=10487.829..14780.165 rows=7499932 loops=1)
Group Key: orders_1.o_orderstatus, orders_1.o_clerk, orders_1.o_comment
Planned Partitions: 64 Batches: 697 Memory Usage: 4625kB Disk Usage:
984304kB
-> Gather (cost=773639.83..1088421.02 rows=2237520 width=67) (actual
time=3937.253..6917.972 rows=11153666 loops=1)
Workers Planned: 2
Workers Launched: 2
-> Partial HashAggregate (cost=772639.83..863669.02 rows=1118760
width=67) (actual time=3932.997..5636.808 rows=3717889 loops=3)
Group Key: orders_1.o_orderstatus, orders_1.o_clerk,
orders_1.o_comment
Planned Partitions: 64 Batches: 425 Memory Usage: 4185kB
Disk Usage: 323584kB
Worker 0: Batches: 393 Memory Usage: 4185kB Disk Usage:
323584kB
Worker 1: Batches: 453 Memory Usage: 4185kB Disk Usage:
370688kB
-> Parallel Append (cost=0.00..402174.85 rows=6813149
width=67) (actual time=0.139..2475.761 rows=3718167 loops=3)
-> Parallel Seq Scan on orders orders_1
(cost=0.00..169517.62 rows=1536891 width=67) (actual time=0.014..769.725
rows=1218167 loops=3)
Filter: (o_orderstatus ~~ '%F'::text)
Rows Removed by Filter: 1281833
-> Parallel Seq Scan on orders (cost=0.00..161706.10
rows=3124610 width=67) (actual time=0.202..2193.033 rows=3750000 loops=2)
Planning Time: 5.025 ms
Execution Time: 15067.809 ms
(18 rows)

[Expected Behavior]
Since these two queries are semantically equivalent, we want to know whether
PostgreSQL would consider adding such optimization given the potential of
performance improvement.

[Test Environment]
Ubuntu 20.04 machine "Linux panda 5.4.0-40-generic #44-Ubuntu SMP Tue Jun 23
00:01:04 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux"
PostgreSQL v12.3
Database: TPC-H benchmark (with scale factor 5)
Output of the “\d orders”:

tpch5=# \d orders
Table "public.orders"
Column | Type | Collation | Nullable | Default
-----------------+-----------------------+-----------+----------+---------
o_orderkey | integer | | not null |
o_custkey | integer | | not null |
o_orderstatus | character(1) | | not null |
o_totalprice | numeric(15,2) | | not null |
o_orderdate | date | | not null |
o_orderpriority | character(15) | | not null |
o_clerk | character(15) | | not null |
o_shippriority | integer | | not null |
o_comment | character varying(79) | | not null |
Indexes:
"orders_pkey" PRIMARY KEY, btree (o_orderkey)
Foreign-key constraints:
"orders_fk1" FOREIGN KEY (o_custkey) REFERENCES customer(c_custkey)
Referenced by:
TABLE "lineitem" CONSTRAINT "lineitem_fk1" FOREIGN KEY (l_orderkey)
REFERENCES orders(o_orderkey)

[Steps for reproducing our observations:]
Download the dataset from the link:
https://drive.google.com/file/d/13rFa1BNDi4e2RmXBn-yEQkcqt6lsBu1c/view?usp=sharing
Set up TPC-H benchmark
tar xzvf tpch5_postgresql.tar.gz
cd tpch5_postgresql
db=tpch5
createdb $db
psql -d $db < dss.ddl
for i in `ls *.tbl`
do
echo $i
name=`echo $i|cut -d'.' -f1`
psql -d $db -c "COPY $name FROM '`pwd`/$i' DELIMITER '|' ENCODING
'LATIN1';"
done
psql -d $db < dss_postgres.ri
Execute the queries

Browse pgsql-bugs by date

  From Date Subject
Next Message Michał Albrycht 2020-09-25 07:55:57 Re: BUG #16634: Conflicting names of indexes for partitioned tables
Previous Message Xinyu Liu 2020-09-25 02:43:55 Re: BUG #16624: Query Optimizer - Performance bug related to predicate simplification