Skip site navigation (1) Skip section navigation (2)

too complex query plan for not exists query and multicolumn indexes

From: Corin <wakathane(at)gmail(dot)com>
To: pgsql-performance(at)postgresql(dot)org
Subject: too complex query plan for not exists query and multicolumn indexes
Date: 2010-03-19 12:26:35
Message-ID: 4BA36D7B.6010506@gmail.com (view raw or flat)
Thread:
Lists: pgsql-performance
Hi all!

While evaluting the pgsql query planer I found some weird behavior of 
the query planer. I think it's plan is way too complex and could much 
faster?

CREATE TABLE friends (
    id integer NOT NULL,
    user_id integer NOT NULL,
    ref_id integer NOT NULL,
);

ALTER TABLE ONLY friends ADD CONSTRAINT friends_pkey PRIMARY KEY (id);
CREATE INDEX user_ref ON friends USING btree (user_id, ref_id);

I fill this table with around 2.800.000 random rows (values between 1 
and 500.000 for user_id, ref_id).

The intention of the query is to find rows with no "partner" row. The 
offset and limit are just to ignore the time needed to send the result 
to the client.

SELECT * FROM friends AS f1 WHERE NOT EXISTS (SELECT 1 FROM friends AS 
f2 WHERE f1.user_id=f2.ref_id AND f1.ref_id=f2.user_id) OFFSET 1000000 
LIMIT 1

Mysql uses this query plan:
1     PRIMARY     f1     index      NULL     user_ref     8      NULL 
    2818860     Using where; Using index
2     DEPENDENT SUBQUERY     f2     ref     user_ref     user_ref     8 
    f1.ref_id,f1.user_id     1     Using index
Time: 9.8s

Postgre uses this query plan:
"Limit  (cost=66681.50..66681.50 rows=1 width=139) (actual 
time=7413.489..7413.489 rows=1 loops=1)"
"  ->  Merge Anti Join  (cost=40520.17..66681.50 rows=367793 width=139) 
(actual time=3705.078..7344.256 rows=1000001 loops=1)"
"        Merge Cond: ((f1.user_id = f2.ref_id) AND (f1.ref_id = 
f2.user_id))"
"        ->  Index Scan using user_ref on friends f1  
(cost=0.00..26097.86 rows=2818347 width=139) (actual 
time=0.093..1222.592 rows=1917360 loops=1)"
"        ->  Materialize  (cost=40520.17..40555.40 rows=2818347 width=8) 
(actual time=3704.977..5043.347 rows=1990148 loops=1)"
"              ->  Sort  (cost=40520.17..40527.21 rows=2818347 width=8) 
(actual time=3704.970..4710.703 rows=1990148 loops=1)"
"                    Sort Key: f2.ref_id, f2.user_id"
"                    Sort Method:  external merge  Disk: 49576kB"
"                    ->  Seq Scan on friends f2  (cost=0.00..18143.18 
rows=2818347 width=8) (actual time=0.015..508.797 rows=2818347 loops=1)"
"Total runtime: 7422.516 ms"

It's already faster, which is great, but I wonder why the query plan is 
that complex.

I read in the pqsql docs that using a multicolumn key is almost never 
needed and only a waste of cpu/space. So I dropped the multicolumn key 
and added to separate keys instead:

CREATE INDEX ref1 ON friends USING btree (ref_id);
CREATE INDEX user1 ON friends USING btree (user_id);

New query plan:
"Limit  (cost=70345.04..70345.04 rows=1 width=139) (actual 
time=43541.709..43541.709 rows=1 loops=1)"
"  ->  Merge Anti Join  (cost=40520.27..70345.04 rows=367793 width=139) 
(actual time=3356.694..43467.818 rows=1000001 loops=1)"
"        Merge Cond: (f1.user_id = f2.ref_id)"
"        Join Filter: (f1.ref_id = f2.user_id)"
"        ->  Index Scan using user1 on friends f1  (cost=0.00..26059.79 
rows=2818347 width=139) (actual time=0.031..1246.668 rows=1917365 loops=1)"
"        ->  Materialize  (cost=40520.17..40555.40 rows=2818347 width=8) 
(actual time=3356.615..14941.405 rows=130503729 loops=1)"
"              ->  Sort  (cost=40520.17..40527.21 rows=2818347 width=8) 
(actual time=3356.611..4127.435 rows=1990160 loops=1)"
"                    Sort Key: f2.ref_id"
"                    Sort Method:  external merge  Disk: 49560kB"
"                    ->  Seq Scan on friends f2  (cost=0.00..18143.18 
rows=2818347 width=8) (actual time=0.012..496.174 rows=2818347 loops=1)"
"Total runtime: 43550.187 ms"

As one can see it's much much slower and only uses one key, not both. I 
thought performance should be almost equal.

I also wonder why it makes a difference when adding a "LIMIT" clause to 
the subselect in an EXISTS subselect. Shouldn't pgsql always stop after 
finding the a row? In mysql is makes no difference in speed, pgsql even 
get's slower when adding a LIMIT to the EXISTS subselect (I hoped it 
would get faster?!).

SELECT * FROM friends AS f1 WHERE NOT EXISTS (SELECT 1 FROM friends AS 
f2 WHERE f1.user_id=f2.ref_id AND f1.ref_id=f2.user_id LIMIT 1) OFFSET 
1000000 LIMIT 1

"Limit  (cost=6389166.19..6389172.58 rows=1 width=139) (actual 
time=54540.356..54540.356 rows=1 loops=1)"
"  ->  Seq Scan on friends f1  (cost=0.00..9003446.87 rows=1409174 
width=139) (actual time=0.511..54460.006 rows=1000001 loops=1)"
"        Filter: (NOT (SubPlan 1))"
"        SubPlan 1"
"          ->  Limit  (cost=2.18..3.19 rows=1 width=0) (actual 
time=0.029..0.029 rows=0 loops=1832284)"
"                ->  Bitmap Heap Scan on friends f2  (cost=2.18..3.19 
rows=1 width=0) (actual time=0.028..0.028 rows=0 loops=1832284)"
"                      Recheck Cond: (($0 = ref_id) AND ($1 = user_id))"
"                      ->  BitmapAnd  (cost=2.18..2.18 rows=1 width=0) 
(actual time=0.027..0.027 rows=0 loops=1832284)"
"                            ->  Bitmap Index Scan on ref1  
(cost=0.00..1.09 rows=75 width=0) (actual time=0.011..0.011 rows=85 
loops=1832284)"
"                                  Index Cond: ($0 = ref_id)"
"                            ->  Bitmap Index Scan on user1  
(cost=0.00..1.09 rows=87 width=0) (actual time=0.011..0.011 rows=87 
loops=1737236)"
"                                  Index Cond: ($1 = user_id)"
"Total runtime: 54540.431 ms"

As in my previous tests, this is only a testing environment: so all data 
is in memory, no disk activity involved at all, no swap etc.

Thanks,
Corin

BTW: I'll respond to the answers of my previous post later.


Responses

pgsql-performance by date

Next:From: Alexandre de Arruda PaesDate: 2010-03-19 13:45:50
Subject: PG using index+filter instead only use index
Previous:From: Dimitri FontaineDate: 2010-03-19 09:30:55
Subject: Re: shared_buffers advice

Privacy Policy | About PostgreSQL
Copyright © 1996-2014 The PostgreSQL Global Development Group