Unnecessary DISTINCT while primary key in SQL

From: 刘瑞 <whx20202(at)gmail(dot)com>
To: pgsql-performance(at)postgresql(dot)org
Subject: Unnecessary DISTINCT while primary key in SQL
Date: 2017-11-04 15:20:42
Message-ID: CAAMf6aTFHErNd-6ZyBH8jzqbNBwgwzFN5HPTJpK9FMJAxvmupw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance

Hi all:

I am new in pgsql, I am even new in using mailing list. I send this email
just to give a suggestion on performance.
I found that if primary key or a colume which has an unique index, is in a
select sql,the distinct sometimes is Unnecessary.
Actually, the SQL with DISTINCT will runs more slowly than the SQL without
DISTINCT.

My english is poor, here is the SQL:

CREATE TABLE test_tbl ( k INT PRIMARY KEY, col text)
INSERT into test_tbl select generate_series(1,10000000), 'test';

SQL with DISTINCT:
test=# explain analyze select distinct col, k from test_tbl order by k
limit 1000;
QUERY PLAN

------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=1277683.22..1277690.72 rows=1000 width=36) (actual
time=12697.994..12698.382 rows=1000 loops=1)
-> Unique (cost=1277683.22..1329170.61 rows=6864985 width=36) (actual
time=12697.992..12698.311 rows=1000 loops=1)
-> Sort (cost=1277683.22..1294845.68 rows=6864985 width=36)
(actual time=12697.991..12698.107 rows=1000 loops=1)
Sort Key: k, col
Sort Method: external sort Disk: 215064kB
-> Seq Scan on test_tbl (cost=0.00..122704.85 rows=6864985
width=36) (actual time=0.809..7561.215 rows=10000000 loops=1)
Planning time: 2.368 ms
Execution time: 12728.471 ms
(8 rows)

SQL without DISTINCT:
test=# explain analyze select col, k from test_tbl order by k limit 1000;
QUERY PLAN

---------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=0.43..31.81 rows=1000 width=9) (actual time=0.667..1.811
rows=1000 loops=1)
-> Index Scan using test_tbl_pkey on test_tbl (cost=0.43..313745.06
rows=10000175 width=9) (actual time=0.666..1.744 rows=1000 loops=1)
Planning time: 0.676 ms
Execution time: 1.872 ms
(4 rows)

Also, after reading "Understanding+How+PostgreSQL+Executes+a+Query", I
guess this happened:
1. the planner see distinct
2. the planner wants to use unique
3. the planner wants to use sort or hash
4. the planner see order by , and the "order by colume" is k, which is in
the "select colume" -- col, k
5. the planner use sort

In fact, the k is primary key, so not only k is distinct, but also any
value combined with k.
And even more, we have a "default primary key index", so we do not need to
sort either.

So my question is :
1. Is my guess correct ?
2. Can we make the planner more clever, so that it can just ignore the
DISTINCT in the cases which just like mine, and use the index ?

Other infomations:
PostgreSQL 9.6.5 on x86_64-pc-linux-gnu, compiled by gcc (Ubuntu
5.4.0-6ubuntu1~16.04.4) 5.4.0 20160609, 64-bit

Best regards.
Rui Liu

Responses

Browse pgsql-performance by date

  From Date Subject
Next Message David Rowley 2017-11-06 00:34:25 Re: Unnecessary DISTINCT while primary key in SQL
Previous Message Dave Nicponski 2017-11-03 15:56:45 Re: Re: OLAP/reporting queries fall into nested loops over seq scans or other horrible planner choices