Skip site navigation (1) Skip section navigation (2)

Re: Low perfomance SUM and Group by large databse

From: Russell Smith <mr-russ(at)pws(dot)com(dot)au>
To: "Sergio Charpinel Jr(dot)" <sergiocharpinel(at)gmail(dot)com>
Cc: pgsql-performance(at)postgresql(dot)org
Subject: Re: Low perfomance SUM and Group by large databse
Date: 2010-06-22 09:45:19
Message-ID: 4C20862F.10706@pws.com.au (view raw or flat)
Thread:
Lists: pgsql-performance
On 22/06/10 00:42, Sergio Charpinel Jr. wrote:
> Hi,
>
[snip]
>
> => explain analyze SELECT ip_src, port_src, ip_dst, port_dst,
> tcp_flags, ip_proto,SUM("bytes"),SUM("packets"),SUM("flows") FROM
> "acct_2010_25" WHERE "stamp_inserted">='2010-06-20 10:10' AND
> "stamp_inserted"<'2010-06-21 10:10' GROUP BY ip_src, port_src, ip_dst,
> port_dst, tcp_flags, ip_proto order by SUM(bytes) desc LIMIT 50 OFFSET 0;
>                                                                       
>                QUERY PLAN                                            
>                                          
> --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>  Limit  (cost=3998662.81..3998662.94 rows=50 width=50) (actual
> time=276981.107..276981.133 rows=50 loops=1)
>    ->  Sort  (cost=3998662.81..4001046.07 rows=953305 width=50)
> (actual time=276981.105..276981.107 rows=50 loops=1)
>          Sort Key: sum(bytes)
>          ->  GroupAggregate  (cost=3499863.27..3754872.33 rows=953305
> width=50) (actual time=165468.257..182677.580 rows=8182616 loops=1)
>                ->  Sort  (cost=3499863.27..3523695.89 rows=9533049
> width=50) (actual time=165468.022..168908.828 rows=9494165 loops=1)
>                      Sort Key: ip_src, port_src, ip_dst, port_dst,
> tcp_flags, ip_proto

You are having to sort and aggregate a large number of rows before you
can get the top 50.  That's 9 million rows in this case, width 50 =
400MB+ sort.  That's going to be slow as you are going to have to sort
it on disk unless you bump up sort mem to 500Mb (bad idea).  So unless
you have really fast storage for temporary tables it's going to take a
while.  About 2.5 minutes you are experiencing at the moment is probably
not too bad.

I'm sure improvements have been made in the area since 8.1 and if you
are able to upgrade to 8.4 which is also offered by Centos5 now, you
might get benefit there.  I can't remember the specific benefits, but I
believe sorting speed has improved, your explain analyze will also give
you more information about what's going on with disk/memory sorting.

>                      ->  Seq Scan on acct_2010_25
>  (cost=0.00..352648.10 rows=9533049 width=50) (actual
> time=0.038..50860.391 rows=9494165 loops=1)
>                            Filter: ((stamp_inserted >= '2010-06-20
> 10:10:00'::timestamp without time zone) AND (stamp_inserted <
> '2010-06-21 10:10:00'::timestamp without time zone))
>  Total runtime: 278791.661 ms
> (9 registros)
>
> Another one just summing bytes (still low):
>
> => explain analyze SELECT ip_src, port_src, ip_dst, port_dst,
> tcp_flags, ip_proto,SUM("bytes") FROM "acct_2010_25" WHERE
> "stamp_inserted">='2010-06-20 10:10' AND "stamp_inserted"<'2010-06-21
> 10:10' GROUP BY ip_src, port_src, ip_dst, port_dst, tcp_flags,
> ip_proto LIMIT 50 OFFSET 0;
>                                                                       
>             QUERY PLAN                                                
>                                   
> --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>  Limit  (cost=3395202.50..3395213.12 rows=50 width=42) (actual
> time=106261.359..106261.451 rows=50 loops=1)
>    ->  GroupAggregate  (cost=3395202.50..3602225.48 rows=974226
> width=42) (actual time=106261.357..106261.435 rows=50 loops=1)
>          ->  Sort  (cost=3395202.50..3419558.14 rows=9742258 width=42)
> (actual time=106261.107..106261.169 rows=176 loops=1)
>                Sort Key: ip_src, port_src, ip_dst, port_dst,
> tcp_flags, ip_proto
>                ->  Seq Scan on acct_2010_25  (cost=0.00..367529.72
> rows=9742258 width=42) (actual time=0.073..8058.598 rows=9494165 loops=1)
>                      Filter: ((stamp_inserted >= '2010-06-20
> 10:10:00'::timestamp without time zone) AND (stamp_inserted <
> '2010-06-21 10:10:00'::timestamp without time zone))
>  Total runtime: 109911.882 ms
> (7 registros)
>
>
> The server has 2 Intel(R) Xeon(R) CPU  E5430 @ 2.66GHz and 16GB RAM.
> I'm using PostgreSQL 8.1.18 default config from Centos 5.5 (just
> increased checkpoint_segments to 50).

Checkpoint segments won't help you as the number of segments is about
writing to the database and how fast that can happen.

>
> What can I change to increase performance?

Increasing sort-memory (work_mem) will give you speed benefits even
though you are going to disk.  I don't know how much spare memory you
have, but trying other values between 8MB and 128MB may be useful just
for the specific query runs.  If you can afford 512Mb for each of the
two sorts, go for that, but it's dangerous as mentioned due to the risk
of using more RAM than you have.  work_mem allocates that amount of
memory per sort.

If you are running these queries all the time, a summary table the
produces there reports on a regular basis, maybe daily or even hourly
would be useful.  Basically the large amount of information that needs
to be processed and sorted is what's taking all the time here. 

Regards

Russell

In response to

Responses

pgsql-performance by date

Next:From: Justin GrafDate: 2010-06-22 13:40:03
Subject: Re: raid10 write performance
Previous:From: Grzegorz JaƛkiewiczDate: 2010-06-22 09:31:08
Subject: raid10 write performance

Privacy Policy | About PostgreSQL
Copyright © 1996-2014 The PostgreSQL Global Development Group