Re: Cost estimates for parameterized paths

From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Cost estimates for parameterized paths
Date: 2010-09-03 11:51:18
Message-ID: AANLkTi=PUeVs5==SZmcLOYo_vGi+QgZPFjGxLiHdX3Gq@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Thu, Sep 2, 2010 at 5:31 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> Awhile back I ranted about replacing the planner's concept of inner
> indexscans with a more generalized notion of "parameterized paths":
> http://archives.postgresql.org/pgsql-hackers/2009-10/msg00994.php
>
> The executor fixes for that are done, and now I'm grappling with getting
> the planner to do something useful with it.  The biggest problem I've run
> into is that a parameterized path can't really be assigned a fixed cost
> in the same way that a normal path can.  The current implementation of
> cost_index() depends on knowing the size of the outer relation --- that
> is, the expected number of execution loops for the indexscan --- in order
> to account for cache effects sanely while estimating the average cost of
> any one inner indexscan.  We know that that is an important thing to do
> because the cost estimates seem to be a lot closer to reality now that we
> do that than what we were getting before; so dropping the consideration is
> entirely out of the question.
>
> The planner is already cheating on this to a considerable extent, because
> it estimates the cost of an inner indexscan only once, using the first
> outer rel we try to join to.  That cost is cached and reused with other
> potential outer-rel join partners, even though very different numbers of
> outer rows might be involved.  This problem will get a lot worse with the
> types of plans that I hope the planner will be able to come up with after
> this fix goes in, because the indexscan might be at the bottom of a join
> nest.  So we need a real fix not another hack.
>
> The best idea I can come up with at the moment is to compute "best case"
> and "worst case" costs for a parameterized path, corresponding to the
> largest and smallest numbers of loops we might expect it to be repeated
> for.  The largest number of loops could be estimated via the cartesian
> product of the sizes of the other tables in the query, for example.  The
> "worst case" cost is its cost if only repeated once.  Then, during path
> pruning in add_path, we only discard a parameterized path if its best-case
> cost is worse than the worst-case cost of some otherwise comparable path.
> Whenever we join the parameterized path with the required outer relation,
> we redo the cost calculation using that rel's actual rowcount estimate in
> order to form a final cost estimate for the no-longer-parameterized join
> path.

Interestingly, I previously proposed almost exactly this approach to
handle a couple of other problems:

http://archives.postgresql.org/pgsql-hackers/2009-09/msg01379.php
(index only scans)
http://archives.postgresql.org/pgsql-hackers/2009-10/msg00125.php
(per-query work mem)

I'm not entirely sure whether we can use this approach for more than
one kind of problem at a time; if we can't, it's probably not a good
idea to do it at all. I also fear that any venture into this area
will involve slowing down add_path(), which is a hotspot when
planning large join nests. That might not be a win, if this is the
only case it allows us to improve.

*thinks*

It strikes me that the outer tuple count is essentially being used to
derate the cost of the index probes. So another way to look at this
is that the best-case cost of an index probe is just the CPU cost, and
the worst-case cost of an index probe is the CPU cost plus the disk
cost. So your fear that not many parameterized paths will be
discarded is probably valid. But then, maybe that's OK. It seems to
me that the critical point is to make sure that we don't form them in
the first place in cases where we could get the same benefit by
commuting the joins. Once we've gotten to the point of considering a
plan of this type, the chances that it's actually the best plan seem
pretty high.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Stephen Frost 2010-09-03 12:04:42 Re: plan time of MASSIVE partitioning ...
Previous Message Magnus Hagander 2010-09-03 11:50:09 Re: Streaming a base backup from master