Re: Querying distinct values from a large table

From: Gregory Stark <gsstark(at)mit(dot)edu>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Alvaro Herrera <alvherre(at)commandprompt(dot)com>, Gregory Stark <gsstark(at)mit(dot)edu>, Luke Lonergan <llonergan(at)greenplum(dot)com>, Chad Wagner <chad(dot)wagner(at)gmail(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>, Igor Lobanov <ilobanov(at)swsoft(dot)com>, Richard Huxton <dev(at)archonet(dot)com>, pgsql-performance(at)postgresql(dot)org
Subject: Re: Querying distinct values from a large table
Date: 2007-01-31 13:24:41
Message-ID: 877iv3miqu.fsf@stark.xeocode.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance

Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> writes:

> Alvaro Herrera <alvherre(at)commandprompt(dot)com> writes:
> > Gregory Stark wrote:
> >> (Incidentally I'm not sure where 2-5x comes from. It's entirely dependant on
> >> your data distribution. It's not hard to come up with distributions where it's
> >> 1000x as fast and others where there's no speed difference.)
>
> > So the figure is really "1-1000x"? I bet this one is more impressive in
> > PHB terms.
>
> Luke has a bad habit of quoting numbers that are obviously derived from
> narrow benchmarking scenarios as Universal Truths, rather than providing
> the context they were derived in. I wish he'd stop doing that...

In fairness I may have exaggerated a bit there. There is a limit to how much
of a speedup you can get in valid benchmarking situations. A single sequential
scan is always going to be necessary so you're only saving the cost of writing
out the temporary file and subsequent merge passes.

It's hard to generate lots of intermediate merge passes since there are only
O(log(n)) of them. So to get 1000x speedup on a large I/O bound sort you would
have to be sorting something on order of 2^1000 records which is ridiculous.
Realistically you should only be able to save 2-5 intermediate merge passes.

On the other there are some common situations where you could see atypical
increases. Consider joining a bunch of small tables to generate a large result
set. The small tables are probably all in memory and the result set may only
have a small number of distinct values. If you throw out the duplicates early
you save *all* the I/O. If you have to do a disk sort it could be many orders
slower.

This is actually not an uncommon coding idiom for MySQL programmers accustomed
to fast DISTINCT working around the lack of subqueries and poor performance of
IN and EXISTS. They often just join together all the tables in a big cross
join and then toss in a DISTINCT at the top to get rid of the duplicates.

--
Gregory Stark
EnterpriseDB http://www.enterprisedb.com

In response to

Responses

Browse pgsql-performance by date

  From Date Subject
Next Message Sidar López Cruz 2007-01-31 14:19:41 Re: Very slow queries
Previous Message Luke Lonergan 2007-01-31 06:58:53 Re: Querying distinct values from a large table