Skip site navigation (1) Skip section navigation (2)

Re: Querying distinct values from a large table

From: Richard Huxton <dev(at)archonet(dot)com>
To: Igor Lobanov <ilobanov(at)swsoft(dot)com>
Cc: pgsql-performance(at)postgresql(dot)org
Subject: Re: Querying distinct values from a large table
Date: 2007-01-30 10:02:33
Message-ID: 45BF17B9.2000502@archonet.com (view raw or flat)
Thread:
Lists: pgsql-performance
Igor Lobanov wrote:
> 
> 
> Richard Huxton wrote:
>>> I have rather large table with about 5 millions of rows and a dozen 
>>> of columns. Let's suppose that columns are named 'a', 'b', 'c' etc. I 
>>> need to query distinct pairs of ('a';'b') from this table.
>  >
>> What version of PostgreSQL is it?
> 
> 8.1.4

Current release is 8.1.6 - probably worth upgrading when you've got 
time. It should be a simple matter of replacing the binaries but do 
check the release notes.

>> How many distinct values are you getting back from your 5 million 
>> rows? If there are too many, an index isn't going to help.
> 
> No more than 10,000.

OK. Should be possible to do something then.

>> Can you share the EXPLAIN ANALYSE output? You might want to try 
>> increasing work_mem for this one query to speed any sorting.
> 
> Real table and colum names are obfuscated because of NDA, sorry.

Fair enough.

> explain analyze select distinct a, b from tbl
> 
> EXPLAIN ANALYZE output is:
> 
>  Unique  (cost=500327.32..525646.88 rows=1848 width=6) (actual 
> time=52719.868..56126.356 rows=5390 loops=1)
>    ->  Sort  (cost=500327.32..508767.17 rows=3375941 width=6) (actual 
> time=52719.865..54919.989 rows=3378864 loops=1)
>          Sort Key: a, b
>          ->  Seq Scan on tbl  (cost=0.00..101216.41 rows=3375941 
> width=6) (actual time=16.643..20652.610 rows=3378864 loops=1)
>  Total runtime: 57307.394 ms

Hmm - am I right in thinking (a,b) are the only two columns on this 
table? That means you'll have a lot of rows per page and an index scan 
could end up fetching lots of pages to check the rows are visible. Still 
- I'd expect it to be better than a seq scan.

The first thing to try is to put the index back and run "SET 
enable_seqscan=off" before the explain analyse. That should force it to 
use the index. Then we'll see what costs it's expecting.

>> How often is the table updated? Clustering might buy you some 
>> improvements (but not a huge amount I suspect).
> 
> It is updated once per 3-5 seconds.

OK - forget clustering then.

> And one more thing. I don't know if it helps, but column 'a' can have 
> value from a limited set: 0, 1 or 2. Column 'b' is also an integer 
> (foreign key, actually).

Hmm - might be worth trying distinct on (b,a) with an index that way 
around - that would give you greater selectivity at the top-level of the 
btree. Can you repeat the EXPLAIN ANALYSE with that too please.

-- 
   Richard Huxton
   Archonet Ltd

In response to

Responses

pgsql-performance by date

Next:From: John ParnefjordDate: 2007-01-30 10:05:03
Subject: Re: Tuning
Previous:From: Igor LobanovDate: 2007-01-30 09:36:12
Subject: Re: Querying distinct values from a large table

Privacy Policy | About PostgreSQL
Copyright © 1996-2014 The PostgreSQL Global Development Group