Tom Lane wrote:
> Another thing that's bothering me is that the index access cost computation
> (in genericcostestimate) is looking sillier and sillier:
> * Estimate the number of index pages that will be retrieved.
> * For all currently-supported index types, the first page of the index is
> * a metadata page, and we should figure on fetching that plus a pro-rated
> * fraction of the remaining pages.
> if (index->pages > 1 && index->tuples > 0)
> numIndexPages = (numIndexTuples / index->tuples) * (index->pages - 1);
> numIndexPages += 1; /* count the metapage too */
> numIndexPages = ceil(numIndexPages);
> numIndexPages = 1.0;
> * Compute the index access cost.
> * Disk cost: our generic assumption is that the index pages will be read
> * sequentially, so they have cost 1.0 each, not random_page_cost.
> *indexTotalCost = numIndexPages;
> There are several things wrong with this, at least in its application to
> btrees. It's not counting descent of the btree (ie, touches of the root
> page and intermediate-level pages). On the other hand it's surely
> overcharging for metapage touches. As of CVS tip we cache the metapage in
> the relcache, so it's probably fair to disregard that cost altogether.
> And on the third hand, when we need to retrieve multiple leaf pages it's
> over-optimistic to assume that those'll be purely sequential fetches.
> (That would be approximately true in a freshly-built btree, but not in one
> that's suffered any amount of page splitting or recycling.)
> I am inclined to think that a reasonable model is to continue to estimate
> the number of index leaf pages touched as above (pro-rating against the
> total number of index entries), to charge random_page_cost per leaf page
> touched, and not to count anything for metapage or tree descent. I
> justify the latter on the grounds that upper tree levels will tend to stay
> in cache because they're touched so often. random_page_cost per leaf page
> may be an overestimate, since sometimes logically adjacent leaf pages will
> be physically adjacent too, but not charging for tree descent should help
> to cancel that out. With this model, the disk cost to fetch a single
> index entry will be estimated as random_page_cost (default 4.0) rather
> than the current fixed 2.0. This shouldn't hurt things too much for
> simple indexscans --- especially since anyone running with a reduced
> random_page_cost won't see as much change. And it will increase the costs
> for bitmap scans that cross many index pages, which is what we need in
> light of Philippe's numbers.
This sounds good to me, although the 2.0 -> 4.0 cost jump may cause
(more) cases of people seeing their index scans in pre-8.2 versions
becoming some other type of access in 8.2. I guess a comment about
testing existing applications could be placed in the 8.2 release notes?
> Now we have seen a lot of cases in which indexscans were being drastically
> overestimated, so increasing the cost estimate even more may seem like a
> horrid idea. But I believe that most or all of these cases were ones
> where the same index was being visited many times, and so the real
> estimation failure is not accounting for cache effects across multiple
> indexscans. Rather than being afraid to raise the cost estimate for an
> indexscan in isolation, I think it's time to bite the bullet and do
> something about that.
> The big difficulty in modeling cache effects from multiple scans is that
> we usually don't know how many times the index will be scanned. If we did
> have that number, I think it'd be reasonable to use the Mackert and Lohman
> approximation already used in cost_index to estimate the total number of
> index leaf pages fetched over N scans, and then divide by N to get our
> per-scan estimated cost. But because the planner builds cost estimates
> bottom-up, in most scenarios we don't have any idea whether an indexscan
> plan node will be iterated once or many times in the finished plan.
> However there is one case where we do have some information: when
> computing a best_inner_indexscan() for a nestloop join, it'd be reasonable
> to use the estimated size of the outer relation as the loop count.
> And this is exactly the case where we are falling down in practice:
> the complaints we've seen are mostly about overestimating the cost of a
> nestloop-with-inner-index-scan. So I think handling just this case would
> be a big step forward, even if it's not a theoretically pure general
If I understand correctly, this is the case were we normally see folks
needing add several 'set enable_xxx=off' statements to get the nested
loop plan that *actually* works best :-). So, also looks good!
In response to
pgsql-hackers by date
|Next:||From: Tom Lane||Date: 2006-06-02 02:27:39|
|Subject: Re: [PATCH] Magic block for modules |
|Previous:||From: Christopher Kings-Lynne||Date: 2006-06-02 01:38:54|
|Subject: Re: [PATCH] Magic block for modules|