Re: slow joining very large table to smaller ones

From: Dan Harris <fbsd(at)drivefaster(dot)net>
To: pgsql-performance(at)postgresql(dot)org
Subject: Re: slow joining very large table to smaller ones
Date: 2005-07-14 22:29:58
Message-ID: C3470621-9DB0-444A-BAD1-F9995E0142FB@drivefaster.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance


On Jul 14, 2005, at 9:42 AM, John A Meinel wrote:
>
>
> You might try giving it a little bit more freedom with:
>
> EXPLAIN ANALYZE
> SELECT recordtext FROM eventactivity, k_r, k_b
> WHERE eventactivity.incidentid = k_r.incidentid
> AND eventactivity.incidentid = k_b.incidentid
> AND k_r.id = 94
> AND k_b.id = 107
> -- AND k_r.incidentid = k_b.incidentid
> ;
>
> I'm pretty sure that would give identical results, just let the
> planner
> have a little bit more freedom about how it does it.
> Also the last line is commented out, because I think it is redundant.
>

Ok, I tried this one. My ssh keeps getting cut off by a router
somewhere between me and the server due to inactivity timeouts, so
all I know is that both the select and explain analyze are taking
over an hour to run. Here's the explain select for that one, since
that's the best I can get.

explain select recordtext from eventactivity,k_r,k_b where
eventactivity.incidentid = k_r.incidentid and
eventactivity.incidentid = k_b.incidentid and k_r.id = 94 and k_b.id
= 107;
QUERY PLAN
------------------------------------------------------------------------
--------------------------------------
Merge Join (cost=9624.61..4679590.52 rows=151009549 width=35)
Merge Cond: (("outer".incidentid)::text = "inner"."?column2?")
-> Merge Join (cost=4766.92..4547684.26 rows=16072733 width=117)
Merge Cond: (("outer".incidentid)::text = "inner"."?column2?")
-> Index Scan using eventactivity1 on eventactivity
(cost=0.00..4186753.16 rows=46029271 width=49)
-> Sort (cost=4766.92..4771.47 rows=1821 width=68)
Sort Key: (k_b.incidentid)::text
-> Index Scan using k_b_idx on k_b
(cost=0.00..4668.31 rows=1821 width=68)
Index Cond: (id = 107)
-> Sort (cost=4857.69..4862.39 rows=1879 width=68)
Sort Key: (k_r.incidentid)::text
-> Index Scan using k_r_idx on k_r (cost=0.00..4755.52
rows=1879 width=68)
Index Cond: (id = 94)
(13 rows)

> You might also try:
> EXPLAIN ANALYZE
> SELECT recordtext
> FROM eventactivity JOIN k_r USING (incidentid)
> JOIN k_b USING (incidentid)
> WHERE k_r.id = 94
> AND k_b.id = 107
> ;
>

Similar results here. The query is taking at least an hour to finish.

explain select recordtext from eventactivity join k_r using
( incidentid ) join k_b using (incidentid ) where k_r.id = 94 and
k_b.id = 107;

QUERY PLAN
------------------------------------------------------------------------
--------------------------------------
Merge Join (cost=9542.77..4672831.12 rows=148391132 width=35)
Merge Cond: (("outer".incidentid)::text = "inner"."?column2?")
-> Merge Join (cost=4726.61..4542825.87 rows=15930238 width=117)
Merge Cond: (("outer".incidentid)::text = "inner"."?column2?")
-> Index Scan using eventactivity1 on eventactivity
(cost=0.00..4184145.43 rows=46000104 width=49)
-> Sort (cost=4726.61..4731.13 rows=1806 width=68)
Sort Key: (k_b.incidentid)::text
-> Index Scan using k_b_idx on k_b
(cost=0.00..4628.92 rows=1806 width=68)
Index Cond: (id = 107)
-> Sort (cost=4816.16..4820.82 rows=1863 width=68)
Sort Key: (k_r.incidentid)::text
-> Index Scan using k_r_idx on k_r (cost=0.00..4714.97
rows=1863 width=68)
Index Cond: (id = 94)
(13 rows)

> Also, if possible give us the EXPLAIN ANALYZE so that we know if the
> planner is making accurate estimates. (You might send an EXPLAIN while
> waiting for the EXPLAIN ANALYZE to finish)
>
> You can also try disabling merge joins, and see how that changes
> things.
>

Are there any negative sideaffects of doing this?

>
>>
>>
>
> Well, postgres is estimating around 500 rows each, is that way off?
> Try
> just doing:
> EXPLAIN ANALYZE SELECT incidentid FROM k_b WHERE id = 107;
> EXPLAIN ANALYZE SELECT incidentid FROM k_r WHERE id = 94;
>
> And see if postgres estimates the number of rows properly.
>
> I assume you have recently VACUUM ANALYZEd, which means you might need
> to update the statistics target (ALTER TABLE k_b ALTER COLUMN
> incidientid SET STATISTICS 100) default is IIRC 10, ranges from
> 1-1000,
> higher is more accurate, but makes ANALYZE slower.
>
>
>>
>> Right now, it looks like pg is trying to sort the entire
>> eventactivity
>> table for the merge join which is taking several minutes to do.
>> Can I
>> rephrase this so that it does the searching through k_r and k_b
>> FIRST
>> and then go into eventactivity using the index on incidentid? It
>> seems
>> like that shouldn't be too hard to make fast but my SQL query skills
>> are only average.
>>
>
> To me, it looks like it is doing an index scan (on k_b.id) through k_b
> first, sorting the results by incidentid, then merge joining that with
> eventactivity.
>
> I'm guessing you actually want it to merge k_b and k_r to get extra
> selectivity before joining against eventactivity.
> I think my alternate forms would let postgres realize this. But if
> not,
> you could try:
>
> EXPLAIN ANALYZE
> SELECT recordtext FROM eventactivity
> JOIN (SELECT incidentid FROM k_r JOIN k_b USING (incidentid)
> WHERE k_r.id = 94 AND k_b.id = 107)
> USING (incidentid);
>

This one looks like the same plan as the others:

explain select recordtext from eventactivity join ( select incidentid
from k_r join k_b using (incidentid) where k_r.id = 94 and k_b.id =
107 ) a using (incidentid );
QUERY PLAN
------------------------------------------------------------------------
--------------------------------------
Merge Join (cost=9793.33..4693149.15 rows=156544758 width=35)
Merge Cond: (("outer".incidentid)::text = "inner"."?column2?")
-> Merge Join (cost=4847.75..4557237.59 rows=16365843 width=117)
Merge Cond: (("outer".incidentid)::text = "inner"."?column2?")
-> Index Scan using eventactivity1 on eventactivity
(cost=0.00..4191691.79 rows=46084161 width=49)
-> Sort (cost=4847.75..4852.38 rows=1852 width=68)
Sort Key: (k_b.incidentid)::text
-> Index Scan using k_b_idx on k_b
(cost=0.00..4747.24 rows=1852 width=68)
Index Cond: (id = 107)
-> Sort (cost=4945.58..4950.36 rows=1913 width=68)
Sort Key: (k_r.incidentid)::text
-> Index Scan using k_r_idx on k_r (cost=0.00..4841.30
rows=1913 width=68)
Index Cond: (id = 94)
(13 rows)

> I don't know how selective your keys are, but one of these queries
> should probably structure it better for the planner. It depends a
> lot on
> how selective your query is.

eventactivity currently has around 36 million rows in it. There
should only be maybe 200-300 incidentids at most that will be matched
with the combination of k_b and k_r. That's why I was thinking I
could somehow get a list of just the incidentids that matched the id
= 94 and id = 107 in k_b and k_r first. Then, I would only need to
grab a few hundred out of 36 million rows from eventactivity.

> If you have 100M rows, the above query looks like it expects k_r to
> restrict it to 44M rows, and k_r + k_b down to 11M rows, which really
> should be a seq scan (> 10% of the rows = seq scan). But if you are
> saying the selectivity is mis-estimated it could be different.
>

Yeah, if I understand you correctly, I think the previous paragraph
shows this is a significant misestimate.

In response to

Responses

Browse pgsql-performance by date

  From Date Subject
Next Message John A Meinel 2005-07-14 22:46:09 Re: slow joining very large table to smaller ones
Previous Message Simon Riggs 2005-07-14 20:47:46 Re: Profiler for PostgreSQL