Re: Patch: add timing of buffer I/O requests

From: Ants Aasma <ants(dot)aasma(at)eesti(dot)ee>
To: Greg Smith <greg(at)2ndquadrant(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Patch: add timing of buffer I/O requests
Date: 2012-02-22 16:35:54
Message-ID: CA+CSw_uTRb7QhHusEm+Gp=8vzdbkcA6sFuCDuRMB6a=+ch_iBA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, Feb 22, 2012 at 4:43 PM, Greg Smith <greg(at)2ndquadrant(dot)com> wrote:
> Attached are updated versions of this feature without the pg_test_timing
> tool part, since I broke that out into another discussion thread.  I've
> split the part that updates pg_stat_statistics out from the main feature
> too, separate patch attached to here (but I'm not reviewing that yet). Lots
> of bitrot since this was submitted, and yes I noticed that I've almost
> recreated earlier versions of this patch--by splitting off the parts that
> were developed later.

Thanks for the review and splitting. Sorry I didn't fix up the bit rot myself.

> Earlier discussion of this got side tracked on a few things, partly my
> fault. It's worth taking a look at what this provides before judging it too
> much.  It can demo well.
>
> The stated purpose is helping figure out what relations are gobbling up the
> most access time, presumably to optimize them and/or the storage they are
> on.  "What do I put onto SSD" is surely a popular request nowadays.

I should have stated the purpose more clearly. The original reason for
developing this patch was to figure out "what queries are taking the
most time and why", specifically in the case where OS memory is a lot
larger than shared_buffers. Basically the following query to get a
quick overview where the bottlenecks are:
SELECT query, total_time, (time_read+time_write)/total_time AS
io_fraction FROM pg_stat_statements ORDER BY total_time DESC LIMIT 20;

This of course hugely benefits from Peter's pg_stat_statements
normalization patch.

Tracking timings per relation was actually an afterthought.

> Now, the first critical question to ask is "what additional information is
> this providing above the existing counters?"  After all, it's possible to
> tell pgbench_accounts is the hotspot just from comparing heap_blks_read,
> right?

Like I said above, I find it mostly useful to see what is missing the
OS cache. With memory being as cheap as it is, a reasonably priced
server can have 128G of memory, while max recommended value for
shared_buffers is 8GB. It's quite likely to have tables that fit into
OS cache but not into shared_buffers, but it's not trivial to figure
out which those are.

> This run looks useful at providing the data wished for--that read times are
> slower per capita from the accounts table.  The first time I tried this I
> got a bizarre high number for pgbench_branches.heap_blks_time ; I'm not sure
> how reliable this is yet.  One problem that might be easy to fix is that the
> write timing info doesn't show in any of these system views, only in EXPLAIN
> and statement level ones.

I'm not sure about the source of the huge number, might instability in
the clock source. Have you tried running the monotonicity check for a
longer period while the system is under load? Another issue with the
current timing code is that gettimeofday isn't guaranteed to be
monotonic anyway, things like NTP adjustments can make time go
backwards. clock_gettime with CLOCK_MONOTONIC_RAW would be better, but
that's linux specific :(

The reason why I didn't add write timings to relation stats is that I
couldn't figure out what the semantics should be. It could be either
"time spent waiting for this relations blocks to be written out" or
"time spent waiting for some other relations blocks to be written out
to free space for this relations block" or maybe distribute the cost,
background writes could be included or excluded. Writes usually return
quickly, unless lots of possibly unrelated writes have dirtied enough
of OS cache, etc. I figured that what ever choices I made, they
wouldn't really help anyone diagnose anything. Having global write
timings in pg_stat_bgwriter might be useful, but I feel that is
something for another patch.

> I still think a full wait timing interface is the right long-term direction
> here.  It's hard to reject this idea when it seems to be working right now
> though, while more comprehensive wait storage is still at least a release
> off.   Opinions welcome, I'm still juggling this around now that I have it
> working again.

I agree that wait timing interface is the right direction. I have
thought a bit about it and could share some ideas - maybe I should
create a wiki page where the general design could be hashed out?

Anyway, the user visible information from this patch should be trivial
to extract from a general wait timing framework. Pushing my own agenda
a bit - having this patch in the current release would help to get
some field experience on any issues surrounding timing :)

> Some implementation notes.  This currently fails regression test
> create_function_3, haven't looked into why yet.

I'll take a look at it.

Thanks again.

--
Ants Aasma

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Greg Smith 2012-02-22 16:44:40 Re: pg_test_timing tool for EXPLAIN ANALYZE overhead
Previous Message Andrew Dunstan 2012-02-22 16:26:34 Re: leakproof