BUG #16887: Group by is faster than distinct

From: PG Bug reporting form <noreply(at)postgresql(dot)org>
To: pgsql-bugs(at)lists(dot)postgresql(dot)org
Cc: liuxy(at)gatech(dot)edu
Subject: BUG #16887: Group by is faster than distinct
Date: 2021-02-22 21:20:23
Message-ID: 16887-976a88d2f4f87d45@postgresql.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

The following bug has been logged on the website:

Bug reference: 16887
Logged by: XINYU LIU
Email address: liuxy(at)gatech(dot)edu
PostgreSQL version: 12.3
Operating system: Ubuntu 20.04
Description:

Hello,

We have 2 TPC-H queries which fetch the same tuples but have significant
query execution time differences (3 times).

We are sharing a pair of TPC-H queries that exhibit this performance
difference:

First query:
SELECT DISTINCT d0.lineitem_l_shipmode
FROM (
SELECT lineitem.l_shipmode AS lineitem_l_shipmode
FROM lineitem
WHERE lineitem.l_shipinstruct IS distinct
FROM 'COLLECT COD ') AS d0
WHERE d0.lineitem_l_shipmode IS DISTINCT
FROM 'MAIL'

Second query:
SELECT d0.lineitem_l_shipmode
FROM (
SELECT lineitem.l_shipmode AS lineitem_l_shipmode
FROM lineitem
WHERE lineitem.l_shipinstruct IS distinct
FROM 'COLLECT COD ') AS d0
WHERE d0.lineitem_l_shipmode IS DISTINCT
FROM 'MAIL'
GROUP BY d0.lineitem_l_shipmode

Actual Behavior
We executed both queries on the TPC-H benchmark of scale factor 5: the first
query takes over 20 seconds, while the second query only takes 6.5 seconds.
We think the time difference results from different plans selected.
Specifically, in the first (slow) query, the optimizer decides to not
parallelize the SCAN and GROUP operations.

Query Execution Plan
First query:
QUERY PLAN

------------------------------------------------------------------------------------------------------------------------------------------
HashAggregate (cost=1060784.91..1060784.98 rows=7 width=11) (actual
time=20233.319..20233.320 rows=6 loops=1)
Group Key: lineitem.l_shipmode
-> Seq Scan on lineitem (cost=0.00..1012391.94 rows=19357187 width=11)
(actual time=0.074..13586.400 rows=19288709 loops=1)
Filter: ((l_shipinstruct IS DISTINCT FROM 'COLLECT COD
'::bpchar) AND (l_shipmode IS DISTINCT FROM 'MAIL'::bpchar))
Rows Removed by Filter: 10711086
Planning Time: 0.133 ms
Execution Time: 20233.374 ms

Second query:
QUERY
PLAN

------------------------------------------------------------------------------------------------------------------------------------------------------------
Group (cost=771057.65..771059.32 rows=7 width=11) (actual
time=6583.724..6583.735 rows=6 loops=1)
Group Key: lineitem.l_shipmode
-> Gather Merge (cost=771057.65..771059.29 rows=14 width=11) (actual
time=6583.722..6585.134 rows=18 loops=1)
Workers Planned: 2
Workers Launched: 2
-> Sort (cost=770057.63..770057.65 rows=7 width=11) (actual
time=6579.986..6579.986 rows=6 loops=3)
Sort Key: lineitem.l_shipmode
Sort Method: quicksort Memory: 25kB
Worker 0: Sort Method: quicksort Memory: 25kB
Worker 1: Sort Method: quicksort Memory: 25kB
-> Partial HashAggregate (cost=770057.46..770057.53 rows=7
width=11) (actual time=6579.945..6579.946 rows=6 loops=3)
Group Key: lineitem.l_shipmode
-> Parallel Seq Scan on lineitem
(cost=0.00..749893.72 rows=8065495 width=11) (actual time=0.058..4510.171
rows=6429570 loops=3)
Filter: ((l_shipinstruct IS DISTINCT FROM
'COLLECT COD '::bpchar) AND (l_shipmode IS DISTINCT FROM
'MAIL'::bpchar))
Rows Removed by Filter: 3570362
Planning Time: 0.159 ms
Execution Time: 6585.196 ms

Expected Behavior
Since these two queries are semantically equivalent, we were hoping that
PostgreSQL will evaluate them in roughly the same amount of time.

Test Environment
Ubuntu 20.04 machine "Linux panda 5.4.0-40-generic #44-Ubuntu SMP Tue Jun 23
00:01:04 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux"
PostgreSQL v12.3
Database: TPC-H benchmark (with scale factor 5)
The description of table lineitem is as follows:
tpch5=# \d+ lineitem;
Table "public.lineitem"
Column | Type | Collation | Nullable | Default |
Storage | Stats target | Description
-----------------+-----------------------+-----------+----------+---------+----------+--------------+-------------
l_orderkey | integer | | not null | |
plain | |
l_partkey | integer | | not null | |
plain | |
l_suppkey | integer | | not null | |
plain | |
l_linenumber | integer | | not null | |
plain | |
l_quantity | numeric(15,2) | | not null | |
main | |
l_extendedprice | numeric(15,2) | | not null | |
main | |
l_discount | numeric(15,2) | | not null | |
main | |
l_tax | numeric(15,2) | | not null | |
main | |
l_returnflag | character(1) | | not null | |
extended | |
l_linestatus | character(1) | | not null | |
extended | |
l_shipdate | date | | not null | |
plain | |
l_commitdate | date | | not null | |
plain | |
l_receiptdate | date | | not null | |
plain | |
l_shipinstruct | character(25) | | not null | |
extended | |
l_shipmode | character(10) | | not null | |
extended | |
l_comment | character varying(44) | | not null | |
extended | |
Indexes:
"lineitem_pkey" PRIMARY KEY, btree (l_orderkey, l_linenumber)

Here are the steps for reproducing our observations:

1. Download the dataset from the link:
https://drive.google.com/file/d/13rFa1BNDi4e2RmXBn-yEQkcqt6lsBu1c/view?usp=sharing
2. Set up TPC-H benchmark
tar xzvf tpch5_postgresql.tar.gz
cd tpch5_postgresql
db=tpch5
createdb $db
psql -d $db < dss.ddl
for i in `ls *.tbl`
do
echo $i
name=`echo $i|cut -d'.' -f1`
psql -d $db -c "COPY $name FROM '`pwd`/$i' DELIMITER '|' ENCODING
'LATIN1';"
done
psql -d $db < dss_postgres.ri
3. Execute the queries

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message Santosh Udupi 2021-02-23 01:08:17 pg_restore - generated column - not populating
Previous Message PG Bug reporting form 2021-02-22 20:54:40 BUG #16886: string equivalence predicate causes system slowdown