Re: Querying distinct values from a large table

From: "Luke Lonergan" <llonergan(at)greenplum(dot)com>
To: "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>, "Alvaro Herrera" <alvherre(at)commandprompt(dot)com>
Cc: "Gregory Stark" <gsstark(at)mit(dot)edu>, "Chad Wagner" <chad(dot)wagner(at)gmail(dot)com>, "Simon Riggs" <simon(at)2ndquadrant(dot)com>, "Igor Lobanov" <ilobanov(at)swsoft(dot)com>, "Richard Huxton" <dev(at)archonet(dot)com>, pgsql-performance(at)postgresql(dot)org
Subject: Re: Querying distinct values from a large table
Date: 2007-01-31 06:58:53
Message-ID: C1E57E2D.19B30%llonergan@greenplum.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance

Tom,

On 1/30/07 9:55 PM, "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:

> Alvaro Herrera <alvherre(at)commandprompt(dot)com> writes:
>> Gregory Stark wrote:
>>> (Incidentally I'm not sure where 2-5x comes from. It's entirely dependant on
>>> your data distribution. It's not hard to come up with distributions where
>>> it's
>>> 1000x as fast and others where there's no speed difference.)
>
>> So the figure is really "1-1000x"? I bet this one is more impressive in
>> PHB terms.
>
> Luke has a bad habit of quoting numbers that are obviously derived from
> narrow benchmarking scenarios as Universal Truths, rather than providing
> the context they were derived in. I wish he'd stop doing that...

In this case I was referring to results obtained using grouping and distinct
optimizations within sort where the benefit is from the use of a single pass
instead of the multiple merge passes for external sort followed by a UNIQUE
operator. In this case, the benefit ranges from 2-5x in many examples as I
mentioned: "from 2-5 times faster than a typical external sort". This is
also the same range of benefits we see for this optimization with a popular
commercial database. With the limit/sort optimization we have seen more
dramatic results, but I think those are less typical cases.

Here are some results for a 1GB table and a simple COUNT(DISTINCT) on a
column with 7 unique values from my dual CPU laptop running Greenplum DB (PG
8.2.1 compatible) on both CPUs. Note that my laptop has 2GB of RAM so I have
the 1GB table loaded into OS I/O cache. The unmodified external sort spills
the sorted attribute to disk, but that takes little time. Note that the
COUNT(DISTINCT) plan embeds a sort as the transition function in the
aggregation node.

=================================================================
================= No Distinct Optimization in Sort ==============
=================================================================
lukelonergan=# select count(distinct l_shipmode) from lineitem;
count
-------
7
(1 row)

Time: 37832.308 ms

lukelonergan=# explain analyze select count(distinct l_shipmode) from
lineitem;
QUERY PLAN
----------------------------------------------------------------------------
----------------------------------------------------------------------------
----
Aggregate (cost=159175.30..159175.31 rows=1 width=8)
Total 1 rows with 40899 ms to end, start offset by 3.189 ms.
-> Gather Motion 2:1 (slice2) (cost=159175.25..159175.28 rows=1
width=8)
recv: Total 2 rows with 39387 ms to first row, 40899 ms to end,
start offset by 3.191 ms.
-> Aggregate (cost=159175.25..159175.26 rows=1 width=8)
Avg 1.00 rows x 2 workers. Max 1 rows (seg0) with 39367 ms
to end, start offset by 22 ms.
-> Redistribute Motion 2:2 (slice1) (cost=0.00..151672.00
rows=3001300 width=8)
recv: Avg 3000607.50 rows x 2 workers. Max 3429492
rows (seg1) with 0.362 ms to first row, 8643 ms to end, start offset by 23
ms.
Hash Key: lineitem.l_shipmode
-> Seq Scan on lineitem (cost=0.00..91646.00
rows=3001300 width=8)
Avg 3000607.50 rows x 2 workers. Max 3001300
rows (seg0) with 0.049 ms to first row, 2813 ms to end, start offset by
12.998 ms.
Total runtime: 40903.321 ms
(12 rows)

Time: 40904.013 ms

=================================================================
================= With Distinct Optimization in Sort ==============
=================================================================
lukelonergan=# set mpp_sort_flags=1;
SET
Time: 1.425 ms
lukelonergan=# select count(distinct l_shipmode) from lineitem;
count
-------
7
(1 row)

Time: 12846.466 ms

lukelonergan=# explain analyze select count(distinct l_shipmode) from
lineitem;

QUERY PLAN
----------------------------------------------------------------------------
----------------------------------------------------------------------------
----
Aggregate (cost=159175.30..159175.31 rows=1 width=8)
Total 1 rows with 13754 ms to end, start offset by 2.998 ms.
-> Gather Motion 2:1 (slice2) (cost=159175.25..159175.28 rows=1
width=8)
recv: Total 2 rows with 13754 ms to end, start offset by 3.000 ms.
-> Aggregate (cost=159175.25..159175.26 rows=1 width=8)
Avg 1.00 rows x 2 workers. Max 1 rows (seg0) with 13734 ms
to end, start offset by 23 ms.
-> Redistribute Motion 2:2 (slice1) (cost=0.00..151672.00
rows=3001300 width=8)
recv: Avg 3000607.50 rows x 2 workers. Max 3429492
rows (seg1) with 0.352 ms to first row, 10145 ms to end, start offset by 26
ms.
Hash Key: lineitem.l_shipmode
-> Seq Scan on lineitem (cost=0.00..91646.00
rows=3001300 width=8)
Avg 3000607.50 rows x 2 workers. Max 3001300
rows (seg0) with 0.032 ms to first row, 4048 ms to end, start offset by
13.037 ms.
Total runtime: 13757.524 ms
(12 rows)

Time: 13758.182 ms

================= Background Information ==============
lukelonergan=# select count(*) from lineitem;
count
---------
6001215
(1 row)

Time: 1661.337 ms
lukelonergan=# \d lineitem
Table "public.lineitem"
Column | Type | Modifiers
-----------------+-----------------------+-----------
l_orderkey | integer | not null
l_partkey | integer | not null
l_suppkey | integer | not null
l_linenumber | integer | not null
l_quantity | double precision | not null
l_extendedprice | double precision | not null
l_discount | double precision | not null
l_tax | double precision | not null
l_returnflag | text | not null
l_linestatus | text | not null
l_shipdate | date | not null
l_commitdate | date | not null
l_receiptdate | date | not null
l_shipinstruct | text | not null
l_shipmode | text | not null
l_comment | character varying(44) | not null
Distributed by: (l_orderkey)

lukelonergan=# select pg_relation_size(oid)/(1000.*1000.) as MB, relname as
Table from pg_class order by MB desc limit 10;
mb | table
------------------------+--------------------------------
1009.4755840000000000 | lineitem
230.0559360000000000 | orders
146.7678720000000000 | partsupp
35.8072320000000000 | part
32.8908800000000000 | customer
1.9333120000000000 | supplier
1.1304960000000000 | pg_proc
0.88473600000000000000 | pg_proc_proname_args_nsp_index
0.81100800000000000000 | pg_attribute
0.81100800000000000000 | pg_depend

- Luke

In response to

Browse pgsql-performance by date

  From Date Subject
Next Message Gregory Stark 2007-01-31 13:24:41 Re: Querying distinct values from a large table
Previous Message Tom Lane 2007-01-31 05:55:19 Re: Querying distinct values from a large table