Skip site navigation (1) Skip section navigation (2)

Re: Querying distinct values from a large table

From: Gregory Stark <gsstark(at)mit(dot)edu>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Alvaro Herrera <alvherre(at)commandprompt(dot)com>, Gregory Stark <gsstark(at)mit(dot)edu>, Luke Lonergan <llonergan(at)greenplum(dot)com>, Chad Wagner <chad(dot)wagner(at)gmail(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>, Igor Lobanov <ilobanov(at)swsoft(dot)com>, Richard Huxton <dev(at)archonet(dot)com>, pgsql-performance(at)postgresql(dot)org
Subject: Re: Querying distinct values from a large table
Date: 2007-01-31 13:24:41
Message-ID: 877iv3miqu.fsf@stark.xeocode.com (view raw or flat)
Thread:
Lists: pgsql-performance
Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> writes:

> Alvaro Herrera <alvherre(at)commandprompt(dot)com> writes:
> > Gregory Stark wrote:
> >> (Incidentally I'm not sure where 2-5x comes from. It's entirely dependant on
> >> your data distribution. It's not hard to come up with distributions where it's
> >> 1000x as fast and others where there's no speed difference.)
> 
> > So the figure is really "1-1000x"?  I bet this one is more impressive in
> > PHB terms.
> 
> Luke has a bad habit of quoting numbers that are obviously derived from
> narrow benchmarking scenarios as Universal Truths, rather than providing
> the context they were derived in.  I wish he'd stop doing that...

In fairness I may have exaggerated a bit there. There is a limit to how much
of a speedup you can get in valid benchmarking situations. A single sequential
scan is always going to be necessary so you're only saving the cost of writing
out the temporary file and subsequent merge passes.

It's hard to generate lots of intermediate merge passes since there are only
O(log(n)) of them. So to get 1000x speedup on a large I/O bound sort you would
have to be sorting something on order of 2^1000 records which is ridiculous.
Realistically you should only be able to save 2-5 intermediate merge passes.

On the other there are some common situations where you could see atypical
increases. Consider joining a bunch of small tables to generate a large result
set. The small tables are probably all in memory and the result set may only
have a small number of distinct values. If you throw out the duplicates early
you save *all* the I/O. If you have to do a disk sort it could be many orders
slower.

This is actually not an uncommon coding idiom for MySQL programmers accustomed
to fast DISTINCT working around the lack of subqueries and poor performance of
IN and EXISTS. They often just join together all the tables in a big cross
join and then toss in a DISTINCT at the top to get rid of the duplicates.

-- 
  Gregory Stark
  EnterpriseDB          http://www.enterprisedb.com


In response to

Responses

pgsql-performance by date

Next:From: Sidar López CruzDate: 2007-01-31 14:19:41
Subject: Re: Very slow queries
Previous:From: Luke LonerganDate: 2007-01-31 06:58:53
Subject: Re: Querying distinct values from a large table

Privacy Policy | About PostgreSQL
Copyright © 1996-2014 The PostgreSQL Global Development Group