Re: DISTINCT vs GROUP BY - was Re: is (not) distinct from

From: "Sven R(dot) Kunze" <srkunze(at)mail(dot)de>
To: pgsql-general(at)postgresql(dot)org
Subject: Re: DISTINCT vs GROUP BY - was Re: is (not) distinct from
Date: 2017-03-03 09:51:16
Message-ID: b36b9115-8e46-16b7-4b4f-06186cda52a2@mail.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

On 03.03.2017 06:26, George Neuner wrote:
> I know most people here don't pay much - or any - attention to
> SQLServer, however there was an interesting article recently regarding
> significant performance differences between DISTINCT and GROUP BY as
> used to remove duplicates.
>
> https://sqlperformance.com/2017/01/t-sql-queries/surprises-assumptions-group-by-distinct

On a similar note, this is also an interesting read about the topic of
distinct vs group by:

https://blogs.oracle.com/developer/entry/counting_with_oracle_is_faster

Interesting is the performance difference between integers and strings
for PostgreSQL which doesn't exist for Oracle.

I also tried rewriting "select distinct" to "select group by" using
PostgreSQL. It didn't help; it was even worse (see appendix).

> I'll get around to doing some testing soon. For now, I am just asking
> if anyone has ever run into something like this?

Yes, my team did. We use Django on a daily basis to generate SQL
queries. In case of model-spanning queries, a lot of joining and
duplications are involved. Distinct is the "generally" accepted way to
remedy the situation but it's actually more like Tom said: distinct is a
band-aid here. UNIONS and SUBSELECTs would be better I guess.

Sven

** Appendix **

>>>># \d docs
Table "public.docs"
Column | Type | Modifiers
--------+---------+---------------------------------------------------
id | integer | not null default nextval('docs_id_seq'::regclass)
meta | jsonb |
Indexes:
"docs_pkey" PRIMARY KEY, btree (id)

>>>># explain analyze select count(distinct meta->>'blood_group') from
docs;
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------
Aggregate (cost=760497.00..760497.01 rows=1 width=449) (actual
time=37631.727..37631.727 rows=1 loops=1)
-> Seq Scan on docs (cost=0.00..710497.00 rows=10000000 width=449)
(actual time=0.500..3999.417 rows=10000000 loops=1)
Planning time: 0.211 ms
Execution time: 37631.829 ms
(4 rows)

>>>># explain analyze select count(*) from (select meta->>'blood_group'
from docs group by meta->>'blood_group') as x;
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------------
Aggregate (cost=4441923.83..4441923.84 rows=1 width=0) (actual
time=41189.472..41189.472 rows=1 loops=1)
-> Group (cost=4241923.83..4316923.83 rows=10000000 width=449)
(actual time=31303.690..41189.455 rows=8 loops=1)
Group Key: ((docs.meta ->> 'blood_group'::text))
-> Sort (cost=4241923.83..4266923.83 rows=10000000
width=449) (actual time=31303.686..40475.227 rows=10000000 loops=1)
Sort Key: ((docs.meta ->> 'blood_group'::text))
Sort Method: external merge Disk: 129328kB
-> Seq Scan on docs (cost=0.00..735497.00 rows=10000000
width=449) (actual time=0.349..6433.691 rows=10000000 loops=1)
Planning time: 2.189 ms
Execution time: 41203.669 ms
(9 rows)

In response to

Browse pgsql-general by date

  From Date Subject
Next Message Geoff Winkless 2017-03-03 10:43:32 Re: ERROR: functions in index expression must be marked IMMUTABLE
Previous Message Achilleas Mantzios 2017-03-03 09:24:11 Re: PG on SSD