Slow Count-Distinct Query

From: Christopher Jackson <crjackso(at)gmail(dot)com>
To: pgsql-performance(at)postgresql(dot)org
Subject: Slow Count-Distinct Query
Date: 2014-03-30 19:45:51
Message-ID: CAN81C19-crDaZF3QysefWqL8qNSPZnJu2HY-ys4C-Szpd7LBvw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance

Hi all,

tl;dr - How can I speed up my count-distinct query?

I apologize in advance if this question has been asked already. I'm
finding the mailing list hard to navigate. I'm trying to speed up a query
that will find a count of distinct emails with in a table using Postgres
9.3.3. The name of the table is participants. Our domain is set up such
that duplicate emails are allowed so long as a particular corresponding
column value is unique.

*TABLE participants*

* id serial NOT NULL (primary key)*

* email character varying(255)*

* (other columns omitted)*

I have the following index defined:

*index_participants_on_email ON participants USING btree (email COLLATE
pg_catalog."default");*

The query I'm trying to run is select count(distinct email) from
participants. I've also tried the *group by* equivalent. *On a table size
of 2 million rows, the query takes about 1 minute to return.* This is way
too long. After running analyze, I see that the index is being ignored and
a full table scan is performed.

So, I tried running the following after dropping the index:
create index email_idx on participants(email) where email=email;
set enable_bitmapscan = false;
set seq_page_cost = 0.1;
set random_page_cost = 0.2;
create index email_idx_2 on participants(email);
cluster participants using email_idx_2;

With these settings in place, if I run *select count(distinct email) from
participants* I get

"Aggregate (cost=29586.20..29586.21 rows=1 width=18) (actual
time=54243.643..54243.644 rows=1 loops=1)"
" -> Seq Scan on participants (cost=0.00..24586.18 rows=2000008
width=18) (actual time=0.030..550.296 rows=2000008 loops=1)"
*"Total runtime: 54243.669 ms"*

When I run the following, I get MUCH better results
*select count(1) from (select email from participants where email=email
group by email) x;*

"Aggregate (cost=1856.36..1856.37 rows=1 width=0) (actual
time=1393.573..1393.573 rows=1 loops=1)"
" Output: count(1)"
" -> Group (cost=0.43..1731.36 rows=10000 width=18) (actual
time=0.052..1205.977 rows=2000008 loops=1)"
" Output: participants.email"
" -> Index Only Scan using email_idx on public.participants
(cost=0.43..1706.36 rows=10000 width=18) (actual time=0.050..625.248
rows=2000008 loops=1)"
" Output: participants.email"
" Heap Fetches: 2000008"
*"Total runtime: 1393.599 ms"*

This query has a weird where clause (email=email) because I'm trying to
force the analyzer's hand to use the index.

*I'm concerned about setting the enable_bitmapscan and seq_page_cost values
because I'm not yet sure what the consequences are. Can anyone enlighten
me on the recommended way to speed up this query?*

Thanks

Responses

Browse pgsql-performance by date

  From Date Subject
Next Message Cora Ma 2014-03-31 08:31:39 performance degradation after launching postgres cluster using pgpool-II
Previous Message Brett Wooldridge 2014-03-30 03:41:30 Re: Connection pooling - Number of connections