Re: slow joining very large table to smaller ones

From: Dan Harris <fbsd(at)drivefaster(dot)net>
To: pgsql-performance(at)postgresql(dot)org
Subject: Re: slow joining very large table to smaller ones
Date: 2005-07-15 00:12:07
Message-ID: A8A7C7A0-7DC3-449C-A081-C1557DB9E16A@drivefaster.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance


On Jul 14, 2005, at 5:12 PM, John A Meinel wrote:

> Dan Harris wrote:
>
>
>>>
>>> Well, postgres is estimating around 500 rows each, is that way
>>> off? Try
>>> just doing:
>>> EXPLAIN ANALYZE SELECT incidentid FROM k_b WHERE id = 107;
>>> EXPLAIN ANALYZE SELECT incidentid FROM k_r WHERE id = 94;
>>>
>
> Once again, do this and post the results. We might just need to tweak
> your settings so that it estimates the number of rows correctly,
> and we
> don't need to do anything else.
>

Ok, sorry I missed these the first time through:

explain analyze select incidentid from k_b where id = 107;
QUERY PLAN
------------------------------------------------------------------------
------------------------------------------------
Index Scan using k_b_idx on k_b (cost=0.00..1926.03 rows=675
width=14) (actual time=0.042..298.394 rows=2493 loops=1)
Index Cond: (id = 107)
Total runtime: 299.103 ms

select count(*) from k_b;
count
--------
698350

( sorry! I think I said this one only had tens of thousands in it )

explain analyze select incidentid from k_r where id =
94; QUERY PLAN
------------------------------------------------------------------------
-------------------------------------------------
Index Scan using k_r_idx on k_r (cost=0.00..2137.61 rows=757
width=14) (actual time=0.092..212.187 rows=10893 loops=1)
Index Cond: (id = 94)
Total runtime: 216.498 ms
(3 rows)

select count(*) from k_r;
count
--------
671670

That one is quite a bit slower, yet it's the same table structure and
same index as k_b, also it has fewer records.

I did run VACUUM ANALYZE immediately before running these queries.
It seems a lot better with the join_collapse set.

>
> \
> Well, the planner is powerful enough to flatten nested selects. To
> make
> it less "intelligent" you can do:
> SET join_collapse_limit 1;
> or
> SET join_collapse_limit 0;
> Which should tell postgres to not try and get tricky with your query.
> Again, *usually* the planner knows better than you do. So again
> just do
> it to see what you get.
>

Ok, when join_collapse_limit = 1 I get this now:

explain analyze select recordtext from eventactivity join ( select
incidentid from k_r join k_b using (incidentid) where k_r.id = 94 and
k_b.id = 107 ) a using (incidentid );

QUERY PLAN
------------------------------------------------------------------------
-----------------------------------------------------------------------
Nested Loop (cost=0.00..156509.08 rows=2948 width=35) (actual
time=1.555..340.625 rows=24825 loops=1)
-> Nested Loop (cost=0.00..5361.89 rows=6 width=28) (actual
time=1.234..142.078 rows=366 loops=1)
-> Index Scan using k_b_idx on k_b (cost=0.00..1943.09
rows=681 width=14) (actual time=0.423..56.974 rows=2521 loops=1)
Index Cond: (id = 107)
-> Index Scan using k_r_idx on k_r (cost=0.00..5.01
rows=1 width=14) (actual time=0.031..0.031 rows=0 loops=2521)
Index Cond: ((k_r.id = 94) AND
((k_r.incidentid)::text = ("outer".incidentid)::text))
-> Index Scan using eventactivity1 on eventactivity
(cost=0.00..25079.55 rows=8932 width=49) (actual time=0.107..0.481
rows=68 loops=366)
Index Cond: ((eventactivity.incidentid)::text =
("outer".incidentid)::text)
Total runtime: 347.975 ms

MUCH better! Maybe you can help me understand what I did and if I
need to make something permanent to get this behavior from now on?

>
>
>
> If you have analyzed recently can you do:
> SELECT relname, reltuples FROM pg_class WHERE relname='eventactivity';
>
> It is a cheaper form than "SELECT count(*) FROM eventactivity" to
> get an
> approximate estimate of the number of rows. But if it isn't too
> expensive, please also give the value from SELECT count(*) FROM
> eventactivity.
>
> Again, that helps us know if your tables are up-to-date.
>

Sure:

select relname, reltuples from pg_class where relname='eventactivity';
relname | reltuples
---------------+-------------
eventactivity | 3.16882e+07

select count(*) from eventactivity;
count
----------
31871142

>
>
>>
>>
>>
>>> I don't know how selective your keys are, but one of these queries
>>> should probably structure it better for the planner. It depends
>>> a lot on
>>> how selective your query is.
>>>
>>
>>
>> eventactivity currently has around 36 million rows in it. There
>> should
>> only be maybe 200-300 incidentids at most that will be matched
>> with the
>> combination of k_b and k_r. That's why I was thinking I could
>> somehow
>> get a list of just the incidentids that matched the id = 94 and id =
>> 107 in k_b and k_r first. Then, I would only need to grab a few
>> hundred
>> out of 36 million rows from eventactivity.
>>
>>
>
> Well, you can also try:
> SELECT count(*) FROM k_b JOIN k_r USING (incidentid)
> WHERE k_b.id=?? AND k_r.id=??
> ;
>
> That will tell you how many rows they have in common.

select count(*) from k_b join k_r using (incidentid) where k_b.id=107
and k_r.id=94;
count
-------
373

>
> Well, if you look at the latest plans, things have gone up from 44M to
> 156M, I don't know why it is worse, but it is getting there.

I assume this is because r_k and r_b are growing fairly rapidly right
now. The time in between queries contained a lot of inserts. I was
careful to vacuum analyze before sending statistics, as I did this
time. I'm sorry if this has confused the issue.

In response to

Responses

Browse pgsql-performance by date

  From Date Subject
Next Message Alon Goldshuv 2005-07-15 00:22:18 Re: COPY FROM performance improvements
Previous Message Tom Lane 2005-07-14 23:57:27 Re: lots of updates on small table