2011/3/24 Jim Nasby <jim(at)nasby(dot)net>:
> On Mar 22, 2011, at 11:46 AM, Cédric Villemain wrote:
>> 2011/3/22 Greg Stark <gsstark(at)mit(dot)edu>:
>>> On Mon, Mar 21, 2011 at 6:08 PM, Jim Nasby <jim(at)nasby(dot)net> wrote:
>>>> Has anyone looked at the overhead of measuring how long IO requests to the kernel take? If we did that not only could we get an idea of what our IO workload looked like, we could also figure out whether a block came out of cache or not. That information could potentially be useful to the planner, but even if the database couldn't use that knowledge itself it would be a damn useful statistic to have... IMHO, far more useful than our current hit rate statistics.
>>> I've done this -- actually better, I used mincore to actually check
>>> whether the block was in cache before issuing the read -- but it turns
>>> out you can't get what you're looking for this way.
>> The linux fincore() syscall never get in the kernel, maybe something
>> to revive...
> Is there an equivalent in other OSes? Could we use time measurement as an alternative if not?
I made a quick test with time measurement, and find quickly the main
bottleneck with this strategy. How to know if block has been fetched
from OS memory, SAN memory, quick RAID, slow SATA ......
I just added a gettimeofday around the read() call, and adjust the
XXms|µs used to seperate disk fetch and memory fetch.
By manualy adjusting this duration I get good results but wonder how
this can be automatically adjusted on other systems, also the method
use for measuring may impact the measure.
Maybe using it to just track 'slow' access, and define 'slow access' in a GUC...
>>> It turns out when you do this you see one block being read from disk
>>> followed by n blocks that all appear to be cache hits. Because they've
>>> been prefetched by the kernel.
>> I did the same, I now believe that it is not very important to have
>> the very exact numbers.
>> Prefetech blocks *are* in memory when we request them, the first read
>> access read more than one block because the cost is the same.
> Yeah... there's places in the planner where we make guesses as to the likelyhood of something being in-cache. If we could actually track complete hit rate over time (PG buffers + FS cache), then we wouldn't have to guess at things anymore.
> And having this info in pg_stats would be extremely valuable.
>>> What you end up with is actually something like the number of iops
>>> which is also an interesting measure but not really what you were
>>> looking for.
>>> My getrusage patch, which I should still dig out though it's rather
>>> too late to be committing now unless someone tells me otherwise, would
>>> tell you how much i/o a plan node actually did. But you won't know
>>> which blocks did the i/o since I was only tracking totals for the plan
>>> node. That's probably what you're looking for here.
>> Please show us the patch :)
> Jim C. Nasby, Database Architect jim(at)nasby(dot)net
> 512.569.9461 (cell) http://jim.nasby.net
Cédric Villemain 2ndQuadrant
http://2ndQuadrant.fr/ PostgreSQL : Expertise, Formation et Support
In response to
pgsql-hackers by date
|Next:||From: Andres Freund||Date: 2011-04-09 12:14:43|
|Subject: Re: really lazy vacuums?|
|Previous:||From: Bruce Momjian||Date: 2011-04-09 11:03:29|
|Subject: Re: pg_upgrade bug found!|