Re: Slow query with a lot of data

From: Moritz Onken <onken(at)houseofdesign(dot)de>
To:
Cc: pgsql-performance(at)postgresql(dot)org
Subject: Re: Slow query with a lot of data
Date: 2008-08-21 07:03:53
Message-ID: 7DE7116C-4BBC-4776-AEE0-63A2E2DEAF16@houseofdesign.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance


Am 20.08.2008 um 20:06 schrieb Scott Carey:

> Ok, so the problem boils down to the sort at the end.
>
> The query up through the merge join on domain is as fast as its
> going to get. The sort at the end however, should not happen
> ideally. There are not that many rows returned, and it should
> hash_aggregate if it thinks there is enough space to do so.
>
> The query planner is going to choose the sort > agg over the hash-
> agg if it estimates the total number of resulting rows to be large
> enough so that the hash won't fit in work_mem. However, there
> seems to be another factor here based on this:
>
>
> GroupAggregate (cost=11745105.66..12277396.
> 81 rows=28704 width=12)"
> " -> Sort (cost=11745105.66..11878034.93 rows=53171707 width=12)"
>
> " Sort Key: a."user", b.category"
> " -> Merge Join (cost=149241.25..1287278.89 rows=53171707
> width=12)"
>
> " Merge Cond: (b.domain = a.domain)"
>
>
> The planner actually thinks there will only be 28704 rows returned
> of width 12. But it chooses to sort 53 million rows before
> aggregating. Thats either a bug or there's something else wrong
> here. That is the wrong way to aggregate those results no matter
> how much work_mem you have unless I'm completely missing something...
>
> You can try rearranging the query just to see if you can work around
> this. What happens if you compare the explain on:
>
> select
> a."user", b.category, sum(1.0/b.cat_count)::float
> from result a, domain_categories b
> where a."domain" = b."domain"
> and b.depth < 4
> and a.results > 100
> and a."user" < 30000
> group by a."user", b.category
>
>

"HashAggregate (cost=1685527.69..1686101.77 rows=28704 width=12)"
" -> Merge Join (cost=148702.25..1286739.89 rows=53171707 width=12)"
" Merge Cond: (b.domain = a.domain)"
" -> Index Scan using domain_categories_domain on
domain_categories b (cost=0.00..421716.32 rows=5112568 width=12)"
" Filter: (depth < 4)"
" -> Sort (cost=148415.16..148513.60 rows=39376 width=8)"
" Sort Key: a.domain"
" -> Bitmap Heap Scan on result a
(cost=1249.93..145409.79 rows=39376 width=8)"
" Recheck Cond: ("user" < 30000)"
" Filter: (results > 100)"
" -> Bitmap Index Scan on result_user_idx
(cost=0.00..1240.08 rows=66881 width=0)"
" Index Cond: ("user" < 30000)"

> to
>
> select
> c."user", c.category, sum(1.0/c.cat_count)::float
> from (select a."user", b.category, b.cat_count
> from result a, domain_categories b
> where a."domain" = b."domain"
> and b.depth < 4
> and a.results > 100
> and a."user" < 30000 ) c
> group by c."user", c.category
>

"HashAggregate (cost=1685527.69..1686101.77 rows=28704 width=12)"
" -> Merge Join (cost=148702.25..1286739.89 rows=53171707 width=12)"
" Merge Cond: (b.domain = a.domain)"
" -> Index Scan using domain_categories_domain on
domain_categories b (cost=0.00..421716.32 rows=5112568 width=12)"
" Filter: (depth < 4)"
" -> Sort (cost=148415.16..148513.60 rows=39376 width=8)"
" Sort Key: a.domain"
" -> Bitmap Heap Scan on result a
(cost=1249.93..145409.79 rows=39376 width=8)"
" Recheck Cond: ("user" < 30000)"
" Filter: (results > 100)"
" -> Bitmap Index Scan on result_user_idx
(cost=0.00..1240.08 rows=66881 width=0)"
" Index Cond: ("user" < 30000)"

> It shouldn't make a difference, but I've seen things like this help
> before so its worth a try. Make sure work_mem is reasonably sized
> for this test.

It's exactly the same. work_mem was set to 3000MB.

>
>
> Another thing that won't be that fast, but may avoid the sort, is to
> select the subselection above into a temporary table, analyze it,
> and then do the outer select. Make sure your settings for temporary
> space (temp_buffers in 8.3) are large enough for the intermediate
> results (700MB should do it). That won't be that fast, but it will
> most likely be faster than sorting 50 million + rows. There are
> lots of problems with this approach but it may be worth the
> experiment.
>

I'll try this.

Thanks so far!

In response to

Responses

Browse pgsql-performance by date

  From Date Subject
Next Message Moritz Onken 2008-08-21 07:04:38 Re: Slow query with a lot of data
Previous Message Kranti K K Parisa™ 2008-08-21 07:03:47 Re: PostgreSQL+Hibernate Performance