Re: Querying distinct values from a large table

From: imad <immaad(at)gmail(dot)com>
To: "Richard Huxton" <dev(at)archonet(dot)com>
Cc: "Igor Lobanov" <ilobanov(at)swsoft(dot)com>, pgsql-performance(at)postgresql(dot)org
Subject: Re: Querying distinct values from a large table
Date: 2007-01-30 11:39:20
Message-ID: 1f30b80c0701300339t1fee83b4x138a7d139874e9e7@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance

To be sure about the performance of index scan, try forcing the
planner to use it instead of seq scan. A way might be to force the
planner to use index scan on your table by using a dummy where clause.
Try using a condition in your where clause which holds true for all
rows.

--Imad
www.EnterpriseDB.com

On 1/30/07, Richard Huxton <dev(at)archonet(dot)com> wrote:
> Igor Lobanov wrote:
> >
> >
> > Richard Huxton wrote:
> >>> I have rather large table with about 5 millions of rows and a dozen
> >>> of columns. Let's suppose that columns are named 'a', 'b', 'c' etc. I
> >>> need to query distinct pairs of ('a';'b') from this table.
> > >
> >> What version of PostgreSQL is it?
> >
> > 8.1.4
>
> Current release is 8.1.6 - probably worth upgrading when you've got
> time. It should be a simple matter of replacing the binaries but do
> check the release notes.
>
> >> How many distinct values are you getting back from your 5 million
> >> rows? If there are too many, an index isn't going to help.
> >
> > No more than 10,000.
>
> OK. Should be possible to do something then.
>
> >> Can you share the EXPLAIN ANALYSE output? You might want to try
> >> increasing work_mem for this one query to speed any sorting.
> >
> > Real table and colum names are obfuscated because of NDA, sorry.
>
> Fair enough.
>
> > explain analyze select distinct a, b from tbl
> >
> > EXPLAIN ANALYZE output is:
> >
> > Unique (cost=500327.32..525646.88 rows=1848 width=6) (actual
> > time=52719.868..56126.356 rows=5390 loops=1)
> > -> Sort (cost=500327.32..508767.17 rows=3375941 width=6) (actual
> > time=52719.865..54919.989 rows=3378864 loops=1)
> > Sort Key: a, b
> > -> Seq Scan on tbl (cost=0.00..101216.41 rows=3375941
> > width=6) (actual time=16.643..20652.610 rows=3378864 loops=1)
> > Total runtime: 57307.394 ms
>
> Hmm - am I right in thinking (a,b) are the only two columns on this
> table? That means you'll have a lot of rows per page and an index scan
> could end up fetching lots of pages to check the rows are visible. Still
> - I'd expect it to be better than a seq scan.
>
> The first thing to try is to put the index back and run "SET
> enable_seqscan=off" before the explain analyse. That should force it to
> use the index. Then we'll see what costs it's expecting.
>
> >> How often is the table updated? Clustering might buy you some
> >> improvements (but not a huge amount I suspect).
> >
> > It is updated once per 3-5 seconds.
>
> OK - forget clustering then.
>
> > And one more thing. I don't know if it helps, but column 'a' can have
> > value from a limited set: 0, 1 or 2. Column 'b' is also an integer
> > (foreign key, actually).
>
> Hmm - might be worth trying distinct on (b,a) with an index that way
> around - that would give you greater selectivity at the top-level of the
> btree. Can you repeat the EXPLAIN ANALYSE with that too please.
>
> --
> Richard Huxton
> Archonet Ltd
>
> ---------------------------(end of broadcast)---------------------------
> TIP 4: Have you searched our list archives?
>
> http://archives.postgresql.org
>

In response to

Browse pgsql-performance by date

  From Date Subject
Next Message Simon Riggs 2007-01-30 14:04:15 Re: Querying distinct values from a large table
Previous Message Pomarede Nicolas 2007-01-30 10:54:05 Re: int4 vs varchar to store ip addr