Re: Slow query with a lot of data

From: Moritz Onken <onken(at)houseofdesign(dot)de>
To:
Cc: pgsql-performance(at)postgresql(dot)org
Subject: Re: Slow query with a lot of data
Date: 2008-08-19 09:03:34
Message-ID: 3663375B-7AC0-43F6-B6EC-D3EE572F3D34@houseofdesign.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance


Am 18.08.2008 um 18:05 schrieb Matthew Wakeling:

> On Mon, 18 Aug 2008, Moritz Onken wrote:
>> Running the query for more than one user is indeed not much slower.
>> That's what I need. I'm clustering the results table on domain
>> right now. But why is this better than clustering it on "user"?
>
> The reason is the way that the merge join algorithm works. What it
> does is takes two tables, and sorts them both by the join fields.
> Then it can stream through both tables producing results as it goes.
> It's the best join algorithm, but it does require both tables to be
> sorted by the same thing, which is domain in this case. The
> aggregating on user happens after the join has been done, and the
> hash aggregate can accept the users in random order.
>
> If you look at your last EXPLAIN, see that it has to sort the result
> table on domain, although it can read the domain_categories in
> domain order due to the clustered index.

explain select
a."user", b.category, sum(1.0/b.cat_count)::float
from result a, domain_categories b
where a."domain" = b."domain"
group by a."user", b.category;

"GroupAggregate (cost=21400443313.69..22050401897.13 rows=35049240
width=12)"
" -> Sort (cost=21400443313.69..21562757713.35 rows=64925759864
width=12)"
" Sort Key: a."user", b.category"
" -> Merge Join (cost=4000210.40..863834009.08
rows=64925759864 width=12)"
" Merge Cond: (b.domain = a.domain)"
" -> Index Scan using domain_categories_domain on
domain_categories b (cost=0.00..391453.79 rows=12105014 width=12)"
" -> Materialize (cost=3999931.73..4253766.93
rows=20306816 width=8)"
" -> Sort (cost=3999931.73..4050698.77
rows=20306816 width=8)"
" Sort Key: a.domain"
" -> Seq Scan on result a
(cost=0.00..424609.16 rows=20306816 width=8)"

Both results and domain_categories are clustered on domain and analyzed.
It took 50 minutes to run this query for 280 users ("and "user" IN
([280 ids])"), 78000 rows were returned and stored in a table. Is this
reasonable?
Why is it still sorting on domain? I thought the clustering should
prevent the planner from doing this?

moritz

In response to

Responses

Browse pgsql-performance by date

  From Date Subject
Next Message Matthew Wakeling 2008-08-19 11:13:21 Re: Slow query with a lot of data
Previous Message Matthew Wakeling 2008-08-18 16:05:11 Re: Slow query with a lot of data