Re: slower merge join on sorted data chosen over

From: "Kevin Grittner" <Kevin(dot)Grittner(at)wicourts(dot)gov>
To: <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: slower merge join on sorted data chosen over
Date: 2005-10-10 23:34:25
Message-ID: s34ab442.057@gwmta.wicourts.gov
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hmmm... With that much direction, I really have no excuse
not to try a change and provide some test cases, do I?

A couple questions occur to me, though. I'm not clear on why
ceil is called -- do we need to eliminate the fraction here?
It seems to me that it wouldn't matter much except when the
index is small, in which case the odds are greater that pages
would be cached. It seems like it would be a less radical change
to eliminate the ceil call than the increment. As a first pass, I'm
inclined to try that in combination with a reduction in the
configured cpu_index_tuple_cost. Does that seem reasonable?

Also, to be clear, I thought someone said that only the leaf level
was counted -- it looks like the other levels are being factored in,
but not as heavily as they would be accessed with logical reads,
because we're taking the fraction of tuples for the index multiplied
by the total number of index pages except for the metadata page.
This should bias the cost slightly higher for additional index levels,
shouldn't it?

I toyed with the idea of eliminating both the ceil and the increment,
and just having a minimum of 1.0, but that seemed extreme.

-Kevin


>>> Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> 10/10/05 4:23 PM >>>
"Kevin Grittner" <Kevin(dot)Grittner(at)wicourts(dot)gov> writes:
> Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> 10/06/05 9:28 PM >>>
>> There's a known issue that the planner tends to overestimate the cost of
>> inner-index-scan nestloops, because it doesn't allow for the strong
>> caching effects associated with repeated scans of the same index (in
>> particular, that all the upper index levels tend to stay in cache).
>> See the archives for past inconclusive discussions about how to fix
>> this.

> I spent a few hours trying different searches in the archives, and
> found three very interesting threads on the topic. All were from
> 2003. Should I keep digging for more recent threads, or would
> these probably represent the current state of the issue?

No, I don't recall that anything much has happened with it.

The place where the rubber hits the road is this passage in
genericcostestimate():

/*
* Estimate the number of index pages that will be retrieved.
*
* For all currently-supported index types, the first page of the index
* is a metadata page, and we should figure on fetching that plus a
* pro-rated fraction of the remaining pages.
*/
if (index->pages > 1 && index->tuples > 0)
{
numIndexPages = (numIndexTuples / index->tuples) * (index->pages - 1);
numIndexPages += 1; /* count the metapage too */
numIndexPages = ceil(numIndexPages);
}
else
numIndexPages = 1.0;

/*
* Compute the index access cost.
*
* Disk cost: our generic assumption is that the index pages will be read
* sequentially, so they have cost 1.0 each, not random_page_cost.
*/
*indexTotalCost = numIndexPages;

An important point here: in all the currently supported index types,
that last assumption is ridiculously optimistic. Except maybe in a
freshly-built btree, logically consecutive index pages won't be
physically adjacent, and so a honest cost accounting would require
charging random_page_cost per page. However, that's not the direction
we want the estimate to go in :-(

We could do something based on effective_cache_size to estimate the
probability that the index page is already in cache and hence need not
be re-read. The main trouble with this is estimating what fraction of
the effective_cache_size can fairly be assumed to be populated by each
index --- without some global knowledge about what other indexes are in
heavy use, it's difficult to see how to get there from here. (OTOH,
one could also level the same charge against the existing uses of
effective_cache_size... maybe we could redefine it as cache available
per query, or something like that, to put part of the problem on the
DBA's shoulders.)

One simple tweak we could make is to not count the index metapage in the
pages-to-be-fetched estimate, which could be justified on the grounds
that the metapage is almost certain to be in cache. (Note that the
existing estimation technique is already logically wrong for btrees,
because it's considering only the metapage and the leaf page(s) that
need to be visited, and not the internal pages that need to be descended
through to get to the right leaf. However, again we can say that the
upper internal pages are probably in cache and so it's fair not to count
them as needing to be read.)

Whether this tweak would make the overall behavior better or worse is
really hard to say. It'd be interesting to look at some test cases
though.

regards, tom lane

Browse pgsql-hackers by date

  From Date Subject
Next Message Andrew Dunstan 2005-10-11 01:06:23 Re: PG 8.1beta3 out soon
Previous Message Greg Sabino Mullane 2005-10-10 22:56:02 Re: PG 8.1beta3 out soon