Re: Parameterized-path cost comparisons need some work

From: Amit Kapila <amit(dot)kapila(at)huawei(dot)com>
To: 'Tom Lane' <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-hackers(at)postgreSQL(dot)org
Subject: Re: Parameterized-path cost comparisons need some work
Date: 2012-02-29 09:28:04
Message-ID: 003601ccf6c4$7117a660$5346f320$%kapila@huawei.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

>The most obvious thing to do about this is to consider that one path can
>dominate another on cost only if it is both cheaper *and* produces no
>more rows. But I'm concerned about the cost of inserting yet another
>test condition into add_path, which is slow enough already. Has anyone
>got an idea for a different formulation that would avoid that?

Will it discard the seq. scan path if the number of rows is more and cost is
less or will it keep both paths and then later based on cost estimation for
join it discards one of them?
Can it be like if seq. scan has low cost and number of rows is only greater
by certain thresh-hold, only then seq. scan will be kept else index scan
will dominate.

-----Original Message-----
From: pgsql-hackers-owner(at)postgresql(dot)org
[mailto:pgsql-hackers-owner(at)postgresql(dot)org] On Behalf Of Tom Lane
Sent: Wednesday, February 29, 2012 4:05 AM
To: pgsql-hackers(at)postgreSQL(dot)org
Subject: [HACKERS] Parameterized-path cost comparisons need some work

I was experimenting today with a test case sent to me by somebody at Red
Hat. The details aren't too important, except that it involves an inner
join on the inside of a left join (where it cannot commute with the left
join). I can reproduce the behavior with a standard regression test
table, if I crank random_page_cost up a bit:

regression=# set random_page_cost TO 5;
SET
regression=# explain select * from tenk1 t1 left join (tenk1 t2 join tenk1
t3 on t2.thousand = t3.unique2) on t1.hundred = t2.hundred where t1.unique1
= 1;
QUERY PLAN

----------------------------------------------------------------------------
--------------------
Nested Loop Left Join (cost=0.00..507.16 rows=100 width=732)
-> Index Scan using tenk1_unique1 on tenk1 t1 (cost=0.00..10.28 rows=1
width=244)
Index Cond: (unique1 = 1)
-> Nested Loop (cost=0.00..495.63 rows=100 width=488)
-> Index Scan using tenk1_hundred on tenk1 t2 (cost=0.00..446.98
rows=100 width=244)
Index Cond: (t1.hundred = hundred)
-> Index Scan using tenk1_unique2 on tenk1 t3 (cost=0.00..0.48
rows=1 width=244)
Index Cond: (unique2 = t2.thousand)
(8 rows)

regression=# set random_page_cost TO 6;
SET
regression=# explain select * from tenk1 t1 left join (tenk1 t2 join tenk1
t3 on t2.thousand = t3.unique2) on t1.hundred = t2.hundred where t1.unique1
= 1;
QUERY PLAN

----------------------------------------------------------------------------
-----------------
Hash Right Join (cost=928.30..2573.80 rows=100 width=732)
Hash Cond: (t2.hundred = t1.hundred)
-> Hash Join (cost=916.00..2523.00 rows=10000 width=488)
Hash Cond: (t2.thousand = t3.unique2)
-> Seq Scan on tenk1 t2 (cost=0.00..458.00 rows=10000 width=244)
-> Hash (cost=458.00..458.00 rows=10000 width=244)
-> Seq Scan on tenk1 t3 (cost=0.00..458.00 rows=10000
width=244)
-> Hash (cost=12.29..12.29 rows=1 width=244)
-> Index Scan using tenk1_unique1 on tenk1 t1 (cost=0.00..12.29
rows=1 width=244)
Index Cond: (unique1 = 1)
(10 rows)

WTF? How did it suddenly fail to find the double-nested-loop plan, and
have to settle for a much worse plan?

The reason seems to be that for large enough random_page_cost, the
seqscan on t2 is costed as cheaper than an indexscan that selects a
significant fraction of the table. (Notice the t2 scans are nearly the
same cost in these two examples.) That causes add_path to decide that
the unparameterized seqscan is unconditionally better than the
parameterized indexscan, and throw out the latter. Now it is impossible
to form the parameterized nestloop subplan.

The flaw in this logic, of course, is that the seqscan might be cheaper
than the parameterized indexscan, but *it produces a whole lot more
rows*, meaning that any subsequent join will be a lot more expensive.
Previously add_path didn't have to worry about that, because all
ordinary paths for a given relation produce the same number of rows
(and we studiously kept things like inner indexscan paths out of
add_path's view of the world).

The most obvious thing to do about this is to consider that one path can
dominate another on cost only if it is both cheaper *and* produces no
more rows. But I'm concerned about the cost of inserting yet another
test condition into add_path, which is slow enough already. Has anyone
got an idea for a different formulation that would avoid that?

regards, tom lane

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Shigeru Hanada 2012-02-29 10:34:23 Re: pgsql_fdw, FDW for PostgreSQL server
Previous Message Andrea Suisani 2012-02-29 09:22:05 Re: swapcache-style cache?