Re: Speeding up query, Joining 55mil and 43mil records.

From: nicky <nicky(at)valuecare(dot)nl>
To: pgsql-performance(at)postgresql(dot)org
Subject: Re: Speeding up query, Joining 55mil and 43mil records.
Date: 2006-06-22 09:48:45
Message-ID: 449A677D.4020504@valuecare.nl
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance

Hello again,

thanks for all the quick replies.

It seems i wasn't entirely correct on my previous post, i've mixed up
some times/numbers.

Below the correct numbers

MSSQL: SELECT COUNT(*) from JOIN (without insert) 17 minutes
PostgreSQL: SELECT COUNT(*) from JOIN (without insert) 33 minutes
PostgreSQL: complete query 55 minutes

The part i'm really troubled with is the difference in performance for
the select part. Which takes twice as long on PostgreSQL even though it
has a better server then MSSQL.

Changed i've made to postgressql.conf

work_mem = 524288 (1GB, results in out of memory error)
checkpoints_segments = 256
checkpoints_timeout = 3600
checkpoints_warning = 0

I've ran the complete 'explain analyse query' twice. First with
pgsql_tmp on the same disk, then again with pgsql_tmp on a seperate disk.

**** (PostgreSQL) (*pgsql_tmp on same disk*):

Hash Join (cost=1328360.12..6167462.76 rows=7197568 width=118) (actual
time=327982.425..1903423.769 rows=7551616 loops=1)
Hash Cond: (("outer".id)::text = ("inner".id)::text)
-> Seq Scan on src_faktuur_verrsec t0 (cost=0.00..2773789.90
rows=40902852 width=52) (actual time=8.935..613455.204 rows=37368390
loops=1)
Filter: ((substr((code)::text, 1, 2) <> '14'::text) AND
(substr((code)::text, 1, 2) <> '15'::text) AND (substr((code)::text, 1,
2) <> '16'::text) AND (substr((code)::text, 1, 2) <> '17'::text) AND
((substr((correctie)::text, 4, 1) <> '1'::text) OR (correctie IS NULL)))
-> Hash (cost=1188102.97..1188102.97 rows=8942863 width=80) (actual
time=327819.698..327819.698 rows=8761024 loops=1)
-> Bitmap Heap Scan on src_faktuur_verricht t1
(cost=62392.02..1188102.97 rows=8942863 width=80) (actual
time=75911.336..295510.647 rows=8761024 loops=1)
Recheck Cond: (date_part('year'::text, datum) >
2004::double precision)
-> Bitmap Index Scan on src_faktuur_verricht_idx1
(cost=0.00..62392.02 rows=8942863 width=0) (actual
time=75082.080..75082.080 rows=8761024 loops=1)
Index Cond: (date_part('year'::text, datum) >
2004::double precision)
Total runtime: 3355696.015 ms

**** (PostgreSQL) (*pgsql_tmp on seperate disk*)

Hash Join (cost=1328360.12..6167462.76 rows=7197568 width=118) (actual
time=172797.736..919869.708 rows=7551616 loops=1)
Hash Cond: (("outer".id)::text = ("inner".id)::text)
-> Seq Scan on src_faktuur_verrsec t0 (cost=0.00..2773789.90
rows=40902852 width=52) (actual time=0.015..362154.822 rows=37368390
loops=1)
Filter: ((substr((code)::text, 1, 2) <> '14'::text) AND
(substr((code)::text, 1, 2) <> '15'::text) AND (substr((code)::text, 1,
2) <> '16'::text) AND (substr((code)::text, 1, 2) <> '17'::text) AND
((substr((correctie)::text, 4, 1) <> '1'::text) OR (correctie IS NULL)))
-> Hash (cost=1188102.97..1188102.97 rows=8942863 width=80) (actual
time=172759.255..172759.255 rows=8761024 loops=1)
-> Bitmap Heap Scan on src_faktuur_verricht t1
(cost=62392.02..1188102.97 rows=8942863 width=80) (actual
time=4244.840..142144.606 rows=8761024 loops=1)
Recheck Cond: (date_part('year'::text, datum) >
2004::double precision)
-> Bitmap Index Scan on src_faktuur_verricht_idx1
(cost=0.00..62392.02 rows=8942863 width=0) (actual
time=3431.361..3431.361 rows=8761024 loops=1)
Index Cond: (date_part('year'::text, datum) >
2004::double precision)
Total runtime: 2608316.714 ms

A lot of difference in performance. 55 minutes to 42 minutes.

I've ran the 'select count(*) from JOIN' to see the difference on that
part.

**** (PostgreSQL) Explain analyse from SELECT COUNT(*) from the JOIN.
(*pgsql_tmp on seperate disk*)

Aggregate (cost=5632244.93..5632244.94 rows=1 width=0) (actual
time=631993.425..631993.427 rows=1 loops=1)
-> Hash Join (cost=1258493.12..5614251.00 rows=7197568 width=0)
(actual time=237999.277..620018.706 rows=7551616 loops=1)
Hash Cond: (("outer".id)::text = ("inner".id)::text)
-> Seq Scan on src_faktuur_verrsec t0 (cost=0.00..2773789.90
rows=40902852 width=14) (actual time=23.449..200532.422 rows=37368390
loops=1)
Filter: ((substr((code)::text, 1, 2) <> '14'::text) AND
(substr((code)::text, 1, 2) <> '15'::text) AND (substr((code)::text, 1,
2) <> '16'::text) AND (substr((code)::text, 1, 2) <> '17'::text) AND
((substr((correctie)::text, 4, 1) <> '1'::tex (..)
-> Hash (cost=1188102.97..1188102.97 rows=8942863 width=14)
(actual time=237939.262..237939.262 rows=8761024 loops=1)
-> Bitmap Heap Scan on src_faktuur_verricht t1
(cost=62392.02..1188102.97 rows=8942863 width=14) (actual
time=74713.092..216206.478 rows=8761024 loops=1)
Recheck Cond: (date_part('year'::text, datum) >
2004::double precision)
-> Bitmap Index Scan on src_faktuur_verricht_idx1
(cost=0.00..62392.02 rows=8942863 width=0) (actual
time=73892.153..73892.153 rows=8761024 loops=1)
Index Cond: (date_part('year'::text, datum) >
2004::double precision)
Total runtime: 631994.172 ms

A lot of improvement also in the select count: 33 minutes vs 10 minutes.

To us, the speeds are good. Very happy with the performance increase on
that select with join, since 90% of the queries are SELECT based.

The query results in 7551616 records, so that's about 4500 inserts per
second. I'm not sure if that is fast or not. Any further tips would be
welcome.

Thanks everyone.
Nicky

In response to

Responses

Browse pgsql-performance by date

  From Date Subject
Next Message Sven Geisler 2006-06-22 11:29:41 Re: Speeding up query, Joining 55mil and 43mil records.
Previous Message soni de 2006-06-22 04:58:11 Regarding ERROR: fmgr_info: function 2720768: cache lookup failed