Re: Low perfomance SUM and Group by large databse

From: "Sergio Charpinel Jr(dot)" <sergiocharpinel(at)gmail(dot)com>
To: Russell Smith <mr-russ(at)pws(dot)com(dot)au>
Cc: pgsql-performance(at)postgresql(dot)org
Subject: Re: Low perfomance SUM and Group by large databse
Date: 2010-06-29 11:59:37
Message-ID: AANLkTin5Qo_mFZuhOqVVZFobSHYp2-lzY4Cg8YKfUxYc@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance

Hi,

One more question about two specifics query behavior: If I add "AND (ip_dst
= x.x.x.x)", it uses another plan and take a much more time. In both of
them, I'm using WHERE clause. Why this behavior?

=> explain analyze SELECT ip_src, port_src, ip_dst, port_dst, tcp_flags,
ip_proto, bytes, packets, flows FROM "acct_2010_26" WHERE
"stamp_inserted">='2010-06-28 09:07' AND "stamp_inserted"<'2010-06-29 08:07'
AND (ip_dst = '8.8.8.8') ORDER BY bytes DESC LIMIT 50 OFFSET 0;

QUERY PLAN

--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=496332.56..496332.69 rows=50 width=50) (actual
time=125390.523..125390.540 rows=50 loops=1)
-> Sort (cost=496332.56..496351.35 rows=7517 width=50) (actual
time=125390.520..125390.525 rows=50 loops=1)
Sort Key: bytes
-> Index Scan using acct_2010_26_pk on acct_2010_26
(cost=0.00..495848.62 rows=7517 width=50) (actual time=0.589..125385.680
rows=1011 loops=1)
Index Cond: ((stamp_inserted >= '2010-06-28
09:07:00'::timestamp without time zone) AND (stamp_inserted < '2010-06-29
08:07:00'::timestamp without time zone) AND (ip_dst = '8.8.8.8'::inet))
Total runtime: 125390.711 ms
(6 registros)

=> explain analyze SELECT ip_src, port_src, ip_dst, port_dst, tcp_flags,
ip_proto, bytes, packets, flows FROM "acct_2010_26" WHERE
"stamp_inserted">='2010-06-28 09:07' AND "stamp_inserted"<'2010-06-29 08:07'
ORDER BY bytes DESC LIMIT 50 OFFSET 0;

QUERY PLAN

--------------------------------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=0.00..268.25 rows=50 width=50) (actual time=0.150..70.780
rows=50 loops=1)
-> Index Scan Backward using ibytes_acct_2010_26 on acct_2010_26
(cost=0.00..133240575.70 rows=24835384 width=50) (actual time=0.149..70.762
rows=50 loops=1)
Filter: ((stamp_inserted >= '2010-06-28 09:07:00'::timestamp
without time zone) AND (stamp_inserted < '2010-06-29 08:07:00'::timestamp
without time zone))
Total runtime: 70.830 ms
(4 registros)

Thanks in advance.

2010/6/23 Sergio Charpinel Jr. <sergiocharpinel(at)gmail(dot)com>

> Craig, Russel,
>
> I appreciate your help.
>
> Thanks.
>
> 2010/6/22 Russell Smith <mr-russ(at)pws(dot)com(dot)au>
>
> On 22/06/10 00:42, Sergio Charpinel Jr. wrote:
>> > Hi,
>> >
>> [snip]
>> >
>> > => explain analyze SELECT ip_src, port_src, ip_dst, port_dst,
>> > tcp_flags, ip_proto,SUM("bytes"),SUM("packets"),SUM("flows") FROM
>> > "acct_2010_25" WHERE "stamp_inserted">='2010-06-20 10:10' AND
>> > "stamp_inserted"<'2010-06-21 10:10' GROUP BY ip_src, port_src, ip_dst,
>> > port_dst, tcp_flags, ip_proto order by SUM(bytes) desc LIMIT 50 OFFSET
>> 0;
>> >
>> > QUERY PLAN
>> >
>> >
>> --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>> > Limit (cost=3998662.81..3998662.94 rows=50 width=50) (actual
>> > time=276981.107..276981.133 rows=50 loops=1)
>> > -> Sort (cost=3998662.81..4001046.07 rows=953305 width=50)
>> > (actual time=276981.105..276981.107 rows=50 loops=1)
>> > Sort Key: sum(bytes)
>> > -> GroupAggregate (cost=3499863.27..3754872.33 rows=953305
>> > width=50) (actual time=165468.257..182677.580 rows=8182616 loops=1)
>> > -> Sort (cost=3499863.27..3523695.89 rows=9533049
>> > width=50) (actual time=165468.022..168908.828 rows=9494165 loops=1)
>> > Sort Key: ip_src, port_src, ip_dst, port_dst,
>> > tcp_flags, ip_proto
>>
>> You are having to sort and aggregate a large number of rows before you
>> can get the top 50. That's 9 million rows in this case, width 50 =
>> 400MB+ sort. That's going to be slow as you are going to have to sort
>> it on disk unless you bump up sort mem to 500Mb (bad idea). So unless
>> you have really fast storage for temporary tables it's going to take a
>> while. About 2.5 minutes you are experiencing at the moment is probably
>> not too bad.
>>
>> I'm sure improvements have been made in the area since 8.1 and if you
>> are able to upgrade to 8.4 which is also offered by Centos5 now, you
>> might get benefit there. I can't remember the specific benefits, but I
>> believe sorting speed has improved, your explain analyze will also give
>> you more information about what's going on with disk/memory sorting.
>>
>> > -> Seq Scan on acct_2010_25
>> > (cost=0.00..352648.10 rows=9533049 width=50) (actual
>> > time=0.038..50860.391 rows=9494165 loops=1)
>> > Filter: ((stamp_inserted >= '2010-06-20
>> > 10:10:00'::timestamp without time zone) AND (stamp_inserted <
>> > '2010-06-21 10:10:00'::timestamp without time zone))
>> > Total runtime: 278791.661 ms
>> > (9 registros)
>> >
>> > Another one just summing bytes (still low):
>> >
>> > => explain analyze SELECT ip_src, port_src, ip_dst, port_dst,
>> > tcp_flags, ip_proto,SUM("bytes") FROM "acct_2010_25" WHERE
>> > "stamp_inserted">='2010-06-20 10:10' AND "stamp_inserted"<'2010-06-21
>> > 10:10' GROUP BY ip_src, port_src, ip_dst, port_dst, tcp_flags,
>> > ip_proto LIMIT 50 OFFSET 0;
>> >
>> > QUERY PLAN
>> >
>> >
>> --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>> > Limit (cost=3395202.50..3395213.12 rows=50 width=42) (actual
>> > time=106261.359..106261.451 rows=50 loops=1)
>> > -> GroupAggregate (cost=3395202.50..3602225.48 rows=974226
>> > width=42) (actual time=106261.357..106261.435 rows=50 loops=1)
>> > -> Sort (cost=3395202.50..3419558.14 rows=9742258 width=42)
>> > (actual time=106261.107..106261.169 rows=176 loops=1)
>> > Sort Key: ip_src, port_src, ip_dst, port_dst,
>> > tcp_flags, ip_proto
>> > -> Seq Scan on acct_2010_25 (cost=0.00..367529.72
>> > rows=9742258 width=42) (actual time=0.073..8058.598 rows=9494165
>> loops=1)
>> > Filter: ((stamp_inserted >= '2010-06-20
>> > 10:10:00'::timestamp without time zone) AND (stamp_inserted <
>> > '2010-06-21 10:10:00'::timestamp without time zone))
>> > Total runtime: 109911.882 ms
>> > (7 registros)
>> >
>> >
>> > The server has 2 Intel(R) Xeon(R) CPU E5430 @ 2.66GHz and 16GB RAM.
>> > I'm using PostgreSQL 8.1.18 default config from Centos 5.5 (just
>> > increased checkpoint_segments to 50).
>>
>> Checkpoint segments won't help you as the number of segments is about
>> writing to the database and how fast that can happen.
>>
>> >
>> > What can I change to increase performance?
>>
>> Increasing sort-memory (work_mem) will give you speed benefits even
>> though you are going to disk. I don't know how much spare memory you
>> have, but trying other values between 8MB and 128MB may be useful just
>> for the specific query runs. If you can afford 512Mb for each of the
>> two sorts, go for that, but it's dangerous as mentioned due to the risk
>> of using more RAM than you have. work_mem allocates that amount of
>> memory per sort.
>>
>> If you are running these queries all the time, a summary table the
>> produces there reports on a regular basis, maybe daily or even hourly
>> would be useful. Basically the large amount of information that needs
>> to be processed and sorted is what's taking all the time here.
>>
>> Regards
>>
>> Russell
>>
>
>
>
> --
> Sergio Roberto Charpinel Jr.
>

--
Sergio Roberto Charpinel Jr.

In response to

Responses

Browse pgsql-performance by date

  From Date Subject
Next Message Merlin Moncure 2010-06-29 13:18:14 Re: pgbench results on a new server
Previous Message Robert Haas 2010-06-29 10:52:10 Re: PostgreSQL as a local in-memory cache