Cost estimates for parameterized paths

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: pgsql-hackers(at)postgreSQL(dot)org
Subject: Cost estimates for parameterized paths
Date: 2010-09-02 21:31:12
Message-ID: 14624.1283463072@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Awhile back I ranted about replacing the planner's concept of inner
indexscans with a more generalized notion of "parameterized paths":
http://archives.postgresql.org/pgsql-hackers/2009-10/msg00994.php

The executor fixes for that are done, and now I'm grappling with getting
the planner to do something useful with it. The biggest problem I've run
into is that a parameterized path can't really be assigned a fixed cost
in the same way that a normal path can. The current implementation of
cost_index() depends on knowing the size of the outer relation --- that
is, the expected number of execution loops for the indexscan --- in order
to account for cache effects sanely while estimating the average cost of
any one inner indexscan. We know that that is an important thing to do
because the cost estimates seem to be a lot closer to reality now that we
do that than what we were getting before; so dropping the consideration is
entirely out of the question.

The planner is already cheating on this to a considerable extent, because
it estimates the cost of an inner indexscan only once, using the first
outer rel we try to join to. That cost is cached and reused with other
potential outer-rel join partners, even though very different numbers of
outer rows might be involved. This problem will get a lot worse with the
types of plans that I hope the planner will be able to come up with after
this fix goes in, because the indexscan might be at the bottom of a join
nest. So we need a real fix not another hack.

The best idea I can come up with at the moment is to compute "best case"
and "worst case" costs for a parameterized path, corresponding to the
largest and smallest numbers of loops we might expect it to be repeated
for. The largest number of loops could be estimated via the cartesian
product of the sizes of the other tables in the query, for example. The
"worst case" cost is its cost if only repeated once. Then, during path
pruning in add_path, we only discard a parameterized path if its best-case
cost is worse than the worst-case cost of some otherwise comparable path.
Whenever we join the parameterized path with the required outer relation,
we redo the cost calculation using that rel's actual rowcount estimate in
order to form a final cost estimate for the no-longer-parameterized join
path.

While this looks like it would work in principle, I'm concerned that it
would be unable to prune very many parameterized paths, and thus that
planning time might run unacceptably long. The repeated cost calculations
aren't much fun either, although we could probably cache most of that work
if we're willing to throw memory at the problem.

I wonder if anyone's got a better idea ...

regards, tom lane

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2010-09-02 22:21:09 Re: Replacing the pg_get_expr security hack with a datatype solution
Previous Message Robert Haas 2010-09-02 20:25:48 Re: Needs Suggestion