Re: Querying distinct values from a large table

From: Ron <rjpeace(at)earthlink(dot)net>
To: Gregory Stark <gsstark(at)mit(dot)edu>,Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: pgsql-performance(at)postgresql(dot)org
Subject: Re: Querying distinct values from a large table
Date: 2007-01-31 14:46:30
Message-ID: E1HCGjN-0003Gp-GQ@elasmtp-kukur.atl.sa.earthlink.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance

I strongly encourage anyone who is interested in the general external
sorting problem peruse Jim Gray's site:
http://research.microsoft.com/barc/SortBenchmark/

Ron Peacetree

At 08:24 AM 1/31/2007, Gregory Stark wrote:
>Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> writes:
>
> > Alvaro Herrera <alvherre(at)commandprompt(dot)com> writes:
> > > Gregory Stark wrote:
> > >> (Incidentally I'm not sure where 2-5x comes from. It's
> entirely dependant on
> > >> your data distribution. It's not hard to come up with
> distributions where it's
> > >> 1000x as fast and others where there's no speed difference.)
> >
> > > So the figure is really "1-1000x"? I bet this one is more impressive in
> > > PHB terms.
> >
> > Luke has a bad habit of quoting numbers that are obviously derived from
> > narrow benchmarking scenarios as Universal Truths, rather than providing
> > the context they were derived in. I wish he'd stop doing that...
>
>In fairness I may have exaggerated a bit there. There is a limit to how much
>of a speedup you can get in valid benchmarking situations. A single sequential
>scan is always going to be necessary so you're only saving the cost of writing
>out the temporary file and subsequent merge passes.
>
>It's hard to generate lots of intermediate merge passes since there are only
>O(log(n)) of them. So to get 1000x speedup on a large I/O bound sort you would
>have to be sorting something on order of 2^1000 records which is ridiculous.
>Realistically you should only be able to save 2-5 intermediate merge passes.
>
>On the other there are some common situations where you could see atypical
>increases. Consider joining a bunch of small tables to generate a large result
>set. The small tables are probably all in memory and the result set may only
>have a small number of distinct values. If you throw out the duplicates early
>you save *all* the I/O. If you have to do a disk sort it could be many orders
>slower.
>
>This is actually not an uncommon coding idiom for MySQL programmers accustomed
>to fast DISTINCT working around the lack of subqueries and poor performance of
>IN and EXISTS. They often just join together all the tables in a big cross
>join and then toss in a DISTINCT at the top to get rid of the duplicates.
>
>--
> Gregory Stark
> EnterpriseDB http://www.enterprisedb.com
>
>
>---------------------------(end of broadcast)---------------------------
>TIP 3: Have you checked our extensive FAQ?
>
> http://www.postgresql.org/docs/faq

In response to

Browse pgsql-performance by date

  From Date Subject
Next Message Sidar López Cruz 2007-01-31 14:58:02 Re: Very slow queries
Previous Message Ted Allen 2007-01-31 14:32:43 Re: Very slow queries