Re: too complex query plan for not exists query and multicolumn indexes

From: Justin Graf <zzzzz(dot)graf(at)gmail(dot)com>
To: wakathane(at)gmail(dot)com, pgsql-performance(at)postgresql(dot)org
Subject: Re: too complex query plan for not exists query and multicolumn indexes
Date: 2010-03-19 17:02:05
Message-ID: 36f7b6b51003191002h5c6ad9cfhe0a2e7e7a328d5ae@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance

---- Message from Corin <wakathane(at)gmail(dot)com> at 03-19-2010 01:26:35 PM
------

***snip ****
The intention of the query is to find rows with no "partner" row. The
offset and limit are just to ignore the time needed to send the result
to the client.
-------
I don't understand the point of OFFSET, limit will accomplish the same
thing, PG will still execute the query the only difference is PG will skip
the step to count through the first million rows before returning a record.

-------

SELECT * FROM friends AS f1 WHERE NOT EXISTS (SELECT 1 FROM friends AS
f2 WHERE f1.user_id=f2.ref_id AND f1.ref_id=f2.user_id) OFFSET 1000000
LIMIT 1

Mysql uses this query plan:
1 PRIMARY f1 index NULL user_ref 8 NULL
2818860 Using where; Using index
2 DEPENDENT SUBQUERY f2 ref user_ref user_ref 8
f1.ref_id,f1.user_id 1 Using index
Time: 9.8s

-------
if that's a query explain in Mysql its worthless. The above has no
information, does not tell us how long each step is taking, let alone what
it was thinking it would take to make the query work .
------

Postgre uses this query plan:
"Limit (cost=66681.50..66681.50 rows=1 width=139) (actual
time=7413.489..7413.489 rows=1 loops=1)"
" -> Merge Anti Join (cost=40520.17..66681.50 rows=367793 width=139)
(actual time=3705.078..7344.256 rows=1000001 loops=1)"
" *Merge Cond: ((f1.user_id = f2.ref_id) AND (f1.ref_id =
f2.user_id))"*
" -> Index Scan using user_ref on friends f1
(cost=0.00..26097.86 rows=2818347 width=139) (actual
time=0.093..1222.592 rows=1917360 loops=1)"
" -> Materialize (cost=40520.17..40555.40 rows=2818347 width=8)
(actual time=3704.977..5043.347 rows=1990148 loops=1)"
" -> Sort (cost=40520.17..40527.21 rows=2818347 width=8)
(actual time=3704.970..4710.703 rows=1990148 loops=1)"
" Sort Key: f2.ref_id, f2.user_id"
" Sort Method: external merge Disk: 49576kB"
" -> Seq Scan on friends f2 (cost=0.00..18143.18
rows=2818347 width=8) (actual time=0.015..508.797 rows=2818347 loops=1)"
"Total runtime: 7422.516 ms"

---
We can see each step PG takes and make inform decisions what part of the
query is slow . We can See the Sorting the rows takes most of the time
---

It's already faster, which is great, but I wonder why the query plan is
that complex.

----
Its not complex it showing you all the steps which Mysql is not showing you
----

I read in the pqsql docs that using a multicolumn key is almost never
needed and only a waste of cpu/space. So I dropped the multicolumn key
and added to separate keys instead:

----
Where is that at??? I don't recall reading that. PG will only use indexes
that match exactly where/join conditions.

----

CREATE INDEX ref1 ON friends USING btree (ref_id);
CREATE INDEX user1 ON friends USING btree (user_id);

New query plan:
"Limit (cost=70345.04..70345.04 rows=1 width=139) (actual
time=43541.709..43541.709 rows=1 loops=1)"
" -> Merge Anti Join (cost=40520.27..70345.04 rows=367793 width=139)
(actual time=3356.694..43467.818 rows=1000001 loops=1)"
" * Merge Cond: (f1.user_id = f2.ref_id)"
" Join Filter: (f1.ref_id = f2.user_id)"
---
*take note the merge has changed. it now joins on f1.user_id=f2.ref_id then
filters the results down by using the AND condition. Put the index back *
---
*" -> Index Scan using user1 on friends f1 (cost=0.00..26059.79
rows=2818347 width=139) (actual time=0.031..1246.668 rows=1917365 loops=1)"
" -> Materialize (cost=40520.17..40555.40 rows=2818347 width=8)
(actual time=3356.615..14941.405* rows=130503729* loops=1)"
---
take note look at what happened here. this because the of Join is not
limited as it was before.
did you run this query against Mysql with the same kind of indexes???
-----
" -> Sort (cost=40520.17..40527.21 rows=2818347 width=8)
(actual time=3356.611..4127.435 rows=1990160 loops=1)"
" Sort Key: f2.ref_id"
" Sort Method: external merge Disk: 49560kB"
" -> Seq Scan on friends f2 (cost=0.00..18143.18
rows=2818347 width=8) (actual time=0.012..496.174 rows=2818347 loops=1)"
"Total runtime: 43550.187 ms"

I also wonder why it makes a difference when adding a "LIMIT" clause to
the subselect in an EXISTS subselect. Shouldn't pgsql always stop after
finding the a row? In mysql is makes no difference in speed, pgsql even
get's slower when adding a LIMIT to the EXISTS subselect (I hoped it
would get faster?!).

----
Limits occur last after doing all the major work is done
----

SELECT * FROM friends AS f1 WHERE NOT EXISTS (SELECT 1 FROM friends AS
f2 WHERE f1.user_id=f2.ref_id AND f1.ref_id=f2.user_id LIMIT 1) OFFSET
1000000 LIMIT 1

"Limit (cost=6389166.19..6389172.58 rows=1 width=139) (actual
time=54540.356..54540.356 rows=1 loops=1)"
" -> Seq Scan on friends f1 (cost=0.00..9003446.87 rows=1409174
width=139) (actual time=0.511..54460.006 rows=1000001 loops=1)"
" Filter: (NOT (SubPlan 1))"
" * SubPlan 1"
" -> Limit (cost=2.18..3.19 rows=1 width=0) (actual
time=0.029..0.029 rows=0 loops=1832284)"*
---
this caused PG to create a subplan which resulted in a more complex plan
PG is doing lots more work
---
" -> Bitmap Heap Scan on friends f2 (cost=2.18..3.19
rows=1 width=0) (actual time=0.028..0.028 rows=0 loops=1832284)"
" Recheck Cond: (($0 = ref_id) AND ($1 = user_id))"
" -> BitmapAnd (cost=2.18..2.18 rows=1 width=0)
(actual time=0.027..0.027 rows=0 loops=1832284)"
" -> Bitmap Index Scan on ref1
(cost=0.00..1.09 rows=75 width=0) (actual time=0.011..0.011 rows=85
loops=1832284)"
" Index Cond: ($0 = ref_id)"
" -> Bitmap Index Scan on user1
(cost=0.00..1.09 rows=87 width=0) (actual time=0.011..0.011 rows=87
loops=1737236)"
" Index Cond: ($1 = user_id)"
"Total runtime: 54540.431 ms"

---
this plan is not even close to the others
---

Browse pgsql-performance by date

  From Date Subject
Next Message Pierre C 2010-03-19 17:34:53 Re: mysql to postgresql, performance questions
Previous Message Yeb Havinga 2010-03-19 16:31:02 Re: GiST index performance