Re: reducing random_page_cost from 4 to 2 to force index scan

From: Jim Nasby <jim(at)nasby(dot)net>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Greg Smith <greg(at)2ndquadrant(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Jesper Krogh <jesper(at)krogh(dot)cc>, pgsql-performance(at)postgresql(dot)org
Subject: Re: reducing random_page_cost from 4 to 2 to force index scan
Date: 2011-05-19 18:39:58
Message-ID: 091662B1-58B5-489D-8A92-0251A1E01E7C@nasby.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance

On May 19, 2011, at 9:53 AM, Robert Haas wrote:
> On Wed, May 18, 2011 at 11:00 PM, Greg Smith <greg(at)2ndquadrant(dot)com> wrote:
>> Jim Nasby wrote:
>>> I think the challenge there would be how to define the scope of the
>>> hot-spot. Is it the last X pages? Last X serial values? Something like
>>> correlation?
>>>
>>> Hmm... it would be interesting if we had average relation access times for
>>> each stats bucket on a per-column basis; that would give the planner a
>>> better idea of how much IO overhead there would be for a given WHERE clause
>>
>> You've already given one reasonable first answer to your question here. If
>> you defined a usage counter for each histogram bucket, and incremented that
>> each time something from it was touched, that could lead to a very rough way
>> to determine access distribution. Compute a ratio of the counts in those
>> buckets, then have an estimate of the total cached percentage; multiplying
>> the two will give you an idea how much of that specific bucket might be in
>> memory. It's not perfect, and you need to incorporate some sort of aging
>> method to it (probably weighted average based), but the basic idea could
>> work.
>
> Maybe I'm missing something here, but it seems like that would be
> nightmarishly slow. Every time you read a tuple, you'd have to look
> at every column of the tuple and determine which histogram bucket it
> was in (or, presumably, which MCV it is, since those aren't included
> in working out the histogram buckets). That seems like it would slow
> down a sequential scan by at least 10x.

You definitely couldn't do it real-time. But you might be able to copy the tuple somewhere and have a background process do the analysis.

That said, it might be more productive to know what blocks are available in memory and use correlation to guesstimate whether a particular query will need hot or cold blocks. Or perhaps we create a different structure that lets you track the distribution of each column linearly through the table; something more sophisticated than just using correlation.... perhaps something like indicating which stats bucket was most prevalent in each block/range of blocks in a table. That information would allow you to estimate exactly what blocks in the table you're likely to need...
--
Jim C. Nasby, Database Architect jim(at)nasby(dot)net
512.569.9461 (cell) http://jim.nasby.net

In response to

Responses

Browse pgsql-performance by date

  From Date Subject
Next Message Robert Haas 2011-05-19 20:07:39 Re: reducing random_page_cost from 4 to 2 to force index scan
Previous Message Robert Haas 2011-05-19 14:53:21 Re: reducing random_page_cost from 4 to 2 to force index scan