GROUP BY vs DISTINCT

From: Brian Herlihy <btherl(at)yahoo(dot)com(dot)au>
To: Postgresql Performance <pgsql-performance(at)postgresql(dot)org>
Subject: GROUP BY vs DISTINCT
Date: 2006-12-20 07:19:39
Message-ID: 20061220071940.17382.qmail@web52303.mail.yahoo.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance

I have a question about the following. The table has an index on (clicked at time zone 'PST'). I am using postgres 8.1.3

Actually, I think I answered my own question already. But I want to confirm - Is the GROUP BY faster because it doesn't have to sort results, whereas DISTINCT must produce sorted results? This wasn't clear to me from the documentation. If it's true, then I could save considerable time by using GROUP BY where I have been using DISTINCT in the past. Usually I simply want a count of the distinct values, and there is no need to sort for that.

I'm also interested in the count(distinct) case at the bottom. The cost estimate seems similar to the GROUP BY, but the actual cost is much higher.

The table is insert-only and was analyzed before running these queries. The domain column being aggregated has around 16k distinct values, and there are 780k rows in total (for the entire table, not the slice being selected in these queries).

Thanks,
Brian

live:parking=> explain analyze SELECT domain
FROM parked_redirects
WHERE (clicked at time zone 'PST') >= '2006-12-17'
AND (clicked at time zone 'PST')
< '2006-12-18'::timestamp without time zone + '1 day'::interval
GROUP BY domain;

QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
HashAggregate (cost=19078.50..19085.29 rows=679 width=18) (actual time=709.309..717.096 rows=14526 loops=1)
-> Index Scan using parked_redirects_pst on parked_redirects (cost=0.01..17846.82 rows=492672 width=18) (actual time=0.073..406.510 rows=504972 loops=1)
Index Cond: ((timezone('PST'::text, clicked) >= '2006-12-17 00:00:00'::timestamp without time zone) AND (timezone('PST'::text, clicked) < '2006-12-19 00:00:00'::timestamp without time zone))
Total runtime: 719.810 ms
(4 rows)

live:parking=> explain analyze SELECT DISTINCT domain
FROM parked_redirects
WHERE (clicked at time zone 'PST') >= '2006-12-17'
AND (clicked at time zone 'PST')
< '2006-12-18'::timestamp without time zone + '1 day'::interval;

QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Unique (cost=64433.98..66897.52 rows=679 width=18) (actual time=15329.904..15647.849 rows=14526 loops=1)
-> Sort (cost=64433.98..65665.75 rows=492709 width=18) (actual time=15329.901..15511.479 rows=504972 loops=1)
Sort Key: "domain"
-> Index Scan using parked_redirects_pst on parked_redirects (cost=0.01..17847.41 rows=492709 width=18) (actual time=0.068..519.696 rows=504972 loops=1)
Index Cond: ((timezone('PST'::text, clicked) >= '2006-12-17 00:00:00'::timestamp without time zone) AND (timezone('PST'::text, clicked) < '2006-12-19 00:00:00'::timestamp without time zone))
Total runtime: 15666.863 ms
(6 rows)

live:parking=> explain analyze SELECT count(DISTINCT domain)
FROM parked_redirects
WHERE (clicked at time zone 'PST') >= '2006-12-17'
AND (clicked at time zone 'PST')
< '2006-12-18'::timestamp without time zone + '1 day'::interval;

QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Aggregate (cost=19107.20..19107.21 rows=1 width=18) (actual time=11380.530..11380.531 rows=1 loops=1)
-> Index Scan using parked_redirects_pst on parked_redirects (cost=0.01..17873.67 rows=493412 width=18) (actual time=0.022..347.473 rows=504972 loops=1)
Index Cond: ((timezone('PST'::text, clicked) >= '2006-12-17 00:00:00'::timestamp without time zone) AND (timezone('PST'::text, clicked) < '2006-12-19 00:00:00'::timestamp without time zone))
Total runtime: 11384.923 ms
(4 rows)

Responses

Browse pgsql-performance by date

  From Date Subject
Next Message Ragnar 2006-12-20 09:01:28 Re: max_fsm_pages and check_points
Previous Message ALVARO ARCILA 2006-12-20 05:31:21 max_fsm_pages and check_points