Re: Slow query with a lot of data

From: "Scott Carey" <scott(at)richrelevance(dot)com>
To: "Moritz Onken" <onken(at)houseofdesign(dot)de>
Cc: pgsql-performance(at)postgresql(dot)org
Subject: Re: Slow query with a lot of data
Date: 2008-08-20 18:06:06
Message-ID: a1ec7d000808201106j4e901d42q3a296ca1d84c2f9c@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance

Ok, so the problem boils down to the sort at the end.

The query up through the merge join on domain is as fast as its going to
get. The sort at the end however, should not happen ideally. There are not
that many rows returned, and it should hash_aggregate if it thinks there is
enough space to do so.

The query planner is going to choose the sort > agg over the hash-agg if it
estimates the total number of resulting rows to be large enough so that the
hash won't fit in work_mem. However, there seems to be another factor here
based on this:

GroupAggregate (cost=11745105.66..12277396.81 rows=28704 width=12)"
" -> Sort (cost=11745105.66..11878034.93 rows=53171707 width=12)"
" Sort Key: a."user", b.category"
" -> Merge Join (cost=149241.25..1287278.89 rows=53171707
width=12)"
" Merge Cond: (b.domain = a.domain)"

The planner actually thinks there will only be 28704 rows returned of width
12. But it chooses to sort 53 million rows before aggregating. Thats
either a bug or there's something else wrong here. That is the wrong way
to aggregate those results no matter how much work_mem you have unless I'm
completely missing something...

You can try rearranging the query just to see if you can work around this.
What happens if you compare the explain on:

select a."user", b.category, sum(1.0/b.cat_count)::float
from result a, domain_categories b
where a."domain" = b."domain"
and b.depth < 4
and a.results > 100
and a."user" < 30000
group by a."user", b.category

to

select c."user", c.category, sum(1.0/c.cat_count)::float
from (select a."user", b.category, b.cat_count
from result a, domain_categories b
where a."domain" = b."domain"
and b.depth < 4
and a.results > 100
and a."user" < 30000 ) c
group by c."user", c.category

It shouldn't make a difference, but I've seen things like this help before
so its worth a try. Make sure work_mem is reasonably sized for this test.

Another thing that won't be that fast, but may avoid the sort, is to select
the subselection above into a temporary table, analyze it, and then do the
outer select. Make sure your settings for temporary space (temp_buffers in
8.3) are large enough for the intermediate results (700MB should do it).
That won't be that fast, but it will most likely be faster than sorting 50
million + rows. There are lots of problems with this approach but it may be
worth the experiment.

On Tue, Aug 19, 2008 at 5:47 AM, Moritz Onken <onken(at)houseofdesign(dot)de>wrote:

>
> Am 19.08.2008 um 14:17 schrieb Matthew Wakeling:
>
> On Tue, 19 Aug 2008, Moritz Onken wrote:
>>
>>> tablename | attname | n_distinct | correlation
>>> result | domain | 1642 | 1
>>>
>>
>> Well, the important thing is the correlation, which is 1, indicating that
>> Postgres knows that the table is clustered. So I have no idea why it is
>> sorting the entire table.
>>
>> What happens when you run EXPLAIN SELECT * FROM result ORDER BY domain?
>>
>>
> "Index Scan using result_domain_idx on result (cost=0.00..748720.72
> rows=20306816 width=49)"
> ... as it should be.
>
> Sounds like an awfully long time to me. Also, I think restricting it to
>>>> 280 users is probably not making it any faster.
>>>>
>>>
>>> If I hadn't restricted it to 280 users it would have run ~350days...
>>>
>>
>> What makes you say that? Perhaps you could post EXPLAINs of both of the
>> queries.
>>
>> Matthew
>>
>
> That was just a guess. The query needs to retrieve the data for about
> 50,000 users. But it should be fast if I don't retrieve the data for
> specific users but let in run through all rows.
>
> explain insert into setup1 (select
> a."user", b.category, sum(1.0/b.cat_count)::float
> from result a, domain_categories b
> where a."domain" = b."domain"
> and b.depth < 4
> and a.results > 100
> and a."user" < 30000
> group by a."user", b.category);
>
>
> "GroupAggregate (cost=11745105.66..12277396.81 rows=28704 width=12)"
> " -> Sort (cost=11745105.66..11878034.93 rows=53171707 width=12)"
> " Sort Key: a."user", b.category"
> " -> Merge Join (cost=149241.25..1287278.89 rows=53171707
> width=12)"
> " Merge Cond: (b.domain = a.domain)"
> " -> Index Scan using domain_categories_domain on
> domain_categories b (cost=0.00..421716.32 rows=5112568 width=12)"
> " Filter: (depth < 4)"
> " -> Materialize (cost=148954.16..149446.36 rows=39376
> width=8)"
> " -> Sort (cost=148954.16..149052.60 rows=39376
> width=8)"
> " Sort Key: a.domain"
> " -> Bitmap Heap Scan on result a
> (cost=1249.93..145409.79 rows=39376 width=8)"
> " Recheck Cond: ("user" < 30000)"
> " Filter: (results > 100)"
> " -> Bitmap Index Scan on result_user_idx
> (cost=0.00..1240.08 rows=66881 width=0)"
> " Index Cond: ("user" < 30000)"
>
>
> This query limits the number of users to 215 and this query took about 50
> minutes.
> I could create to temp tables which have only those records which I need
> for this query. Would this be a good idea?
>
>
> moritz
>
>
>
> --
> Sent via pgsql-performance mailing list (pgsql-performance(at)postgresql(dot)org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-performance
>

In response to

Responses

Browse pgsql-performance by date

  From Date Subject
Next Message Tom Lane 2008-08-20 18:18:33 Re: Optimizing a VIEW
Previous Message Decibel! 2008-08-20 16:51:31 Re: Optimizing a VIEW