Quick Links

Performance hit on large row counts

From:	David Scott <davids(at)apptechsys(dot)com>
To:	pgsql-performance(at)postgresql(dot)org
Subject:	Performance hit on large row counts
Date:	2005-12-26 21:03:34
Message-ID:	43B05AA6.9020902@apptechsys.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-performance

We are trying to ascertain if we are up against the limits of what
postgres can accomplish without having the tables clustered. We would
prefer not to have to cluster our tables because according to the
documentation this is a one time operation and not maintained. Is there
any other performance tweaks that can be done to avoid clustering? Is
there any way to force the cluster maintenance (even with a performance
hit on load)?
We are aware that there is a minimum time that is required to resolve
the index values against the table to ascertain that they are live rows,
and we believe we are circumventing that time to some extent by taking
advantage of the rows being in physical order with the cluster. So does
this lead us to the conclusion that the differences in the query times
is how long is takes us to check on disk whether or not these rows are live?

Thanks for any help, thoughts, tips or suggestions.

All of these commands are after a vacuum full analyze and the config
file is attached. Different values were used for the queries so no
caching would confuse our stats. The box is running gentoo with
postgres 8.1.0, has raid 0, 9 gigs of ram, 2 hyperthreaded procs, x86_64.

/Three tables with row counts:/
lookup1.count = 3,306,930
lookup2.count = 4,189,734
stuff.count = 3,423,994

/The first attempt (after index adjustments, no hits to cached results)/

explain analyze select col2, count(*) as cnt from stuff where col1 =
56984 group by col2

HashAggregate (cost=14605.68..14605.88 rows=16 width=4) (actual
time=6980.752..6985.893 rows=6389 loops=1)
-> Bitmap Heap Scan on stuff (cost=60.97..14571.44 rows=6848
width=4) (actual time=371.215..6965.742 rows=6389 loops=1)
Recheck Cond: (col1 = 56984)
-> Bitmap Index Scan on stuff_pair_idx (cost=0.00..60.97
rows=6848 width=0) (actual time=361.237..361.237 rows=6389 loops=1)
Index Cond: (col1 = 56984)
Total runtime: 6988.105 ms

/After clustering:/

explain analyze select col2, count(*) as cnt from stuff where col1 =
3540634 group by col2;
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------
HashAggregate (cost=1399.62..1399.63 rows=1 width=4) (actual
time=11.376..15.282 rows=5587 loops=1)
-> Bitmap Heap Scan on stuff (cost=4.36..1397.68 rows=389 width=4)
(actual time=1.029..4.538 rows=5587 loops=1)
Recheck Cond: (col1 = 3540634)
-> Bitmap Index Scan on stuff_col1_idx (cost=0.00..4.36
rows=389 width=0) (actual time=1.003..1.003 rows=5587 loops=1)
Index Cond: (col1 = 3540634)
Total runtime: 17.113 ms

/Using this in the next layer of querying:/

explain analyze SELECT col1,col2, value AS val,
coalesce(coalesce(lookup1.col3, lookup2.col3),0) AS dollars FROM (select
col1, col2, value from stuff where col1 = 95350) stuff LEFT JOIN
lookup1 ON (stuff.col2 = lookup1.pkey) LEFT JOIN lookup2 ON (stuff.col2
= lookup2.pkey);

Nested Loop Left Join (cost=0.00..10325.15 rows=857 width=20) (actual
time=84.223..9306.228 rows=2296 loops=1)
-> Nested Loop Left Join (cost=0.00..5183.25 rows=857 width=16)
(actual time=56.623..1710.655 rows=2296 loops=1)
-> Index Scan using stuff_col1_idx on stuff (cost=0.00..21.57
rows=857 width=12) (actual time=40.531..57.160 rows=2296 loops=1)
Index Cond: (col1 = 4528383)
-> Index Scan using lookup2_pkey on lookup2 (cost=0.00..6.01
rows=1 width=8) (actual time=0.717..0.717 rows=0 loops=2296)
Index Cond: ("outer".col2 = lookup2.pkey)
-> Index Scan using lookup1_pkey on lookup1 (cost=0.00..5.99 rows=1
width=8) (actual time=3.304..3.305 rows=1 loops=2296)
Index Cond: ("outer".col2 = lookup1.pkey)
Total runtime: 9307.569 ms

/After clustering the two left join tables (lookup1 and lookup2):/

Nested Loop Left Join (cost=0.00..10325.15 rows=857 width=20) (actual
time=24.444..84.114 rows=1727 loops=1)
-> Nested Loop Left Join (cost=0.00..5163.47 rows=857 width=16)
(actual time=24.392..62.787 rows=1727 loops=1)
-> Index Scan using stuff_col1_idx on stuff (cost=0.00..21.57
rows=857 width=12) (actual time=24.332..27.455 rows=1727 loops=1)
Index Cond: (col1 = 95350)
-> Index Scan using lookup1_pkey on lookup1 (cost=0.00..5.99
rows=1 width=8) (actual time=0.018..0.018 rows=1 loops=1727)
Index Cond: ("outer".col2 = lookup1.pkey)
-> Index Scan using lookup2_pkey on lookup2 (cost=0.00..6.01 rows=1
width=8) (actual time=0.010..0.010 rows=0 loops=1727)
Index Cond: ("outer".col2 = lookup2.pkey)
Total runtime: 84.860 ms

Attachment	Content-Type	Size
postgresql.conf	text/plain	13.3 KB

Responses

Re: Performance hit on large row counts at 2005-12-26 21:36:30 from Tom Lane

Browse pgsql-performance by date

	From	Date	Subject
Next Message	Ivan Voras	2005-12-26 21:32:13	Bitmap indexes etc.
Previous Message	Luke Lonergan	2005-12-26 18:50:34	Re: What's the best hardver for PostgreSQL 8.1?