Strange number of rows in plan cost

From: Алексей Кузнецов <alex_kuznetcov(at)mail(dot)ru>
To: pgsql-performance(at)postgresql(dot)org
Subject: Strange number of rows in plan cost
Date: 2013-12-23 06:32:46
Message-ID: 1387780366.910542228@f394.i.mail.ru
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance

Hello,

I already posted this question to novice mail list and there is no answer yet. I've decided to post it again here.

Before posting the question here, I checked the mail list again for the same cases and found the message describing the case I started from: http://www.postgresql.org/message-id/51304B61.5030103@gmail.com
It looks like there is no answer for that case too.

I have a performance issue on using UNION ALL clause. Optimizer generates incorrect plan based on strange estimation of returned rows number. It suppose  that plan will be correct if this estimation is done correctly.
The following example helps to reproduce the issue:

CREATE TABLE t1 (c1 INTEGER, id INTEGER PRIMARY KEY);
INSERT INTO t1 (c1, id) SELECT b, b FROM generate_series(1, 1000000) a (b);
REINDEX TABLE t1;
ANALYZE t1;

CREATE TABLE t2 (c1 INTEGER, id INTEGER PRIMARY KEY);
INSERT INTO t2 (c1, id) SELECT b, b FROM generate_series(1, 1000000) a (b);
REINDEX TABLE t2;
ANALYZE t2;

CREATE TABLE links (c1 INTEGER PRIMARY KEY, descr TEXT);
INSERT INTO links (c1, descr) SELECT b, '2' FROM generate_series(1, 100000) a (b);
REINDEX TABLE links;
ANALYZE links;

CREATE TEMP TABLE t3 (ref_id INTEGER);
INSERT INTO t3 (ref_id) VALUES (333333), (666666);
ANALYZE t3;

If I do the following:
EXPLAIN ANALYZE SELECT * FROM (SELECT * FROM t1) t INNER JOIN t3 ON t3.ref_id = t.id INNER JOIN links l ON (t.c1 = l.c1);

QUERY PLAN:
Nested Loop  (cost=0.00..18.39 rows=1 width=18) (actual time=0.056..0.056 rows=0 loops=1)
  ->  Nested Loop  (cost=0.00..17.80 rows=2 width=12) (actual time=0.030..0.047 rows=2 loops=1)
        ->  Seq Scan on t3  (cost=0.00..1.02 rows=2 width=4) (actual time=0.007..0.008 rows=2 loops=1)
        ->  Index Scan using t1_pkey on t1  (cost=0.00..8.38 rows=1 width=8) (actual time=0.015..0.016 rows=1 loops=2)
              Index Cond: (id = t3.ref_id)
  ->  Index Scan using links_pkey on links l  (cost=0.00..0.28 rows=1 width=6) (actual time=0.004..0.004 rows=0 loops=2)
        Index Cond: (c1 = t1.c1)
"Total runtime: 0.118 ms"

It uses correctly index scan on "links" table and works normal.

If I do the following:
EXPLAIN ANALYZE SELECT * FROM (SELECT * FROM t1 UNION ALL SELECT * FROM t2) t INNER JOIN t3 ON t3.ref_id = t.id INNER JOIN links l ON (t.c1 = l.c1);

QUERY PLAN:
Hash Join  (cost=2693.00..3127.58 rows=20000 width=18) (actual time=47.158..47.158 rows=0 loops=1)
  Hash Cond: (t1.c1 = l.c1)
  ->  Nested Loop  (cost=0.00..34.58 rows=20000 width=12) (actual time=0.049..0.101 rows=4 loops=1)
        ->  Seq Scan on t3  (cost=0.00..1.02 rows=2 width=4) (actual time=0.010..0.011 rows=2 loops=1)
        ->  Append  (cost=0.00..16.76 rows=2 width=8) (actual time=0.022..0.038 rows=2 loops=2)
              ->  Index Scan using t1_pkey on t1  (cost=0.00..8.38 rows=1 width=8) (actual time=0.019..0.022 rows=1 loops=2)
                    Index Cond: (id = t3.ref_id)
              ->  Index Scan using t2_pkey on t2  (cost=0.00..8.38 rows=1 width=8) (actual time=0.011..0.012 rows=1 loops=2)
                    Index Cond: (id = t3.ref_id)
  ->  Hash  (cost=1443.00..1443.00 rows=100000 width=6) (actual time=46.988..46.988 rows=100000 loops=1)
        Buckets: 16384  Batches: 1  Memory Usage: 3711kB
        ->  Seq Scan on links l  (cost=0.00..1443.00 rows=100000 width=6) (actual time=0.015..17.443 rows=100000 loops=1)
"Total runtime: 47.246 ms"

It uses sequence scan on "links" table because of strange estimation on selection with using UNION ALL. It is very slow. I have more that million rows in each tables.

Strange estimation is shown bellow:
EXPLAIN ANALYZE SELECT * FROM (SELECT * FROM t1 UNION ALL SELECT * FROM t2) t INNER JOIN t3 ON t3.ref_id = t.id;

QUERY PLAN:
Nested Loop  (cost=0.00..34.58 rows=20000 width=12) (actual time=0.049..0.099 rows=4 loops=1)
  ->  Seq Scan on t3  (cost=0.00..1.02 rows=2 width=4) (actual time=0.009..0.010 rows=2 loops=1)
  ->  Append  (cost=0.00..16.76 rows=2 width=8) (actual time=0.023..0.038 rows=2 loops=2)
        ->  Index Scan using t1_pkey on t1  (cost=0.00..8.38 rows=1 width=8) (actual time=0.021..0.022 rows=1 loops=2)
              Index Cond: (id = t3.ref_id)
        ->  Index Scan using t2_pkey on t2  (cost=0.00..8.38 rows=1 width=8) (actual time=0.013..0.013 rows=1 loops=2)
              Index Cond: (id = t3.ref_id)
"Total runtime: 0.164 ms"

The strange estimation is seen at the first line:
Nested Loop  (cost=0.00..34.58 rows=20000 width=12) (actual time=0.049..0.099 rows=4 loops=1)
At the second line we can see that only two rows will be selected from joined table and 4 rows from two tables appended with UNION ALL, how it estimates that it should handle 20000 rows for that query?
I suppose, that if it estimates 4 rows will be handled then it will use index scan on "links" table and will work fast.

With best regards, Aleksey Kuznetsov.

Browse pgsql-performance by date

  From Date Subject
Next Message Johann Spies 2013-12-23 08:58:59 Re: query not using index
Previous Message David Rowley 2013-12-21 05:46:22 Re: DATE_TRUNC() and GROUP BY?