Re: Patch: add timing of buffer I/O requests

From: Ants Aasma <ants(dot)aasma(at)eesti(dot)ee>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Greg Smith <greg(at)2ndquadrant(dot)com>, Martijn van Oosterhout <kleptog(at)svana(dot)org>, Tomas Vondra <tv(at)fuzzy(dot)cz>
Cc: Dimitri Fontaine <dimitri(at)2ndquadrant(dot)fr>, Greg Stark <stark(at)mit(dot)edu>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Patch: add timing of buffer I/O requests
Date: 2011-11-29 01:36:33
Message-ID: CA+CSw_tHJYRyPcfK+eP8sNQVKDCVv-WG1FQTwKYkwdoSW1UTWg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Sorry for taking so long to respong, had a pretty busy day at work. Anyway..

On Mon, Nov 28, 2011 at 9:54 AM, Greg Smith <greg(at)2ndquadrant(dot)com> wrote:
> Oh no, it's party pooper time again.  Sorry I have to be the one to do it
> this round.  The real problem with this whole area is that we know there are
> systems floating around where the amount of time taken to grab timestamps
> like this is just terrible.  I've been annoyed enough by that problem to
> spend some time digging into why that is--seems to be a bunch of trivia
> around the multiple ways to collect time info on x86 systems--and after this
> CommitFest is over I was already hoping to dig through my notes and start
> quantifying that more.  So you can't really prove the overhead of this
> approach is acceptable just by showing two examples; we need to find one of
> the really terrible clocks and test there to get a real feel for the
> worst-case.

Sure, I know that the timing calls might be awfully slow. That's why I turned
it off by default. I saw that track_functions was already using this, so I
figured it was ok to have it potentially run very slowly.

> -Document the underlying problem and known workarounds, provide a way to
> test how bad the overhead is, and just throw our hands up and say "sorry,
> you just can't instrument like this" if someone has a slow system.

Some documentation about potential problems would definitely be good.
Same goes for a test tool. ISTM that fast accurate timing is just not
possible on all supported platforms. That doesn't seem like a good enough
justification to refuse implementing something useful for the majority that
do as long as it doesn't cause regressions for those that don't or
significant code complexity.

> -Have one of the PostgreSQL background processes keep track of a time
> estimate on its own, only periodically pausing to sync against the real
> time.  Then most calls to gettimeofday() can use that value instead.  I was
> thinking of that idea for slightly longer running things though; I doubt
> that can be made accurate enough to test instrument buffer

This would limit it to those cases where hundreds of milliseconds of jitter
or more don't bother all that much.

> And while I hate to kick off massive bike-shedding in your direction, I'm
> also afraid this area--collecting stats about how long individual operations
> take--will need a much wider ranging approach than just looking at the
> buffer cache ones.  If you step back and ask "what do people expect here?",
> there's a pretty large number who really want something like Oracle's
> v$session_wait  and v$system_event interface for finding the underlying
> source of slow things.  There's enough demand for that that EnterpriseDB has
> even done some work in this area too; what I've been told about it suggests
> the code isn't a great fit for contribution to community PostgreSQL though.
>  Like I said, this area is really messy and hard to get right.

Yeah, something like that should probably be something to strive for. I'll
ponder a bit more about resource and latency tracking a general. Maybe the
question here should be about the cost/benefit ratio of having some utility
now vs maintaining/deprecating the user visible interface when a more
general framework will turn up.

> Something more ambitious like the v$ stuff would also take care of what
> you're doing here; I'm not sure that what you've done helps built it though.
>  Please don't take that personally.  Part of one of my own instrumentation
> patches recently was rejected out of hand for the same reason, just not
> being general enough.

No problem, I understand that half-way solutions can be more trouble than
they're worth. I actually built this to help with performance testing an
application and thought it would be an interesting experience to try to
give the community back something.

On Mon, Nov 28, 2011 at 4:40 PM, Greg Stark <stark(at)mit(dot)edu> wrote:
> I believe on most systems on modern linux kernels gettimeofday an its ilk
> will be a vsyscall and nearly as fast as a regular function call.

clock_gettime() is implemented as a vDSO since 2.6.23. gettimeofday() has
been user context callable since before git shows any history (2.6.12).

On Mon, Nov 28, 2011 at 5:55 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>> The other big problem for a patch of this sort is that it would bloat
>> the stats file.
>
> Yes.  Which begs the question of why we need to measure this per-table.
> I would think per-tablespace would be sufficient.

Yeah, I figured that this is something that should be discussed. I
implemented per-table collection because I thought it might be useful for
tools to pick up and show a quick overview on which tables are causing the
most IO overhead for queries.

On Mon, Nov 28, 2011 at 8:10 PM, Martijn van Oosterhout
<kleptog(at)svana(dot)org> wrote:
> Something good to know: in Linux the file
> /sys/devices/system/clocksource/clocksource0/current_clocksource
> lists the current clock source, and
> /sys/devices/system/clocksource/clocksource0/available_clocksource
> lists the available clock sources. With cat you can switch them. That
> way you may be able to quantify the effects on a single machine.
>
> Learned the hard way while tracking clock-skew on a multicore system.
> The hpet may not be the fastest (that would be the cpu timer), but it's
> the fastest (IME) that gives guarenteed monotonic time.

The Linux kernel seems to go pretty far out of its way to ensure that TSC
(CPU timestamp counter) based clocksource returns monotonic values,
including actually testing if it does. [1] If the hardware doesn't support
stable and consistent tsc values, tsc isn't used as a clock source.

Of course trying to keep it monotonic doesn't mean succeeding. I thought
about inserting a sanity check. But as the current instrumentation doesn't
use one and it would catch errors only in one direction, biasing the long
term average, I decided against it.

Because this is non-essential instrumentation, I don't see an issue with
it returning bogus information when the system clock is broken. Atleast it
seems that no one has complained about the same issue in track_functions.
The only complaint I found is that it's off by default.

On Mon, Nov 28, 2011 at 5:29 PM, Tomas Vondra <tv(at)fuzzy(dot)cz> wrote:
> Another option would be to reimplement the vsyscall, even on platforms
> that don't provide it. The principle is actually quite simple - allocate a
> shared memory, store there a current time and update it whenever a clock
> interrupt happens. This is basically what Greg suggested in one of the
> previous posts, where "regularly" means "on every interrupt". Greg was
> worried about the precision, but this should be just fine I guess. It's
> the precision you get on Linux, anyway ...

On modern platforms you actually really do get the microsecond precision.
Even more, if you use clock_gettime(CLOCK_MONOTONIC), you get nanosecond
precision and avoid issues with someone changing the system time while
you're timing. This precision does require OS and hardware cooperation,
because of CPU offsets, TSC's changing frequencies, stopping, etc.

--
Ants Aasma

[1] https://github.com/torvalds/linux/blob/master/arch/x86/kernel/tsc_sync.c#L143

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Andres Freund 2011-11-29 01:38:39 Re: CommitFest 2011-11 Post-Tryptophan Progress Report
Previous Message Tom Lane 2011-11-29 01:24:18 Re: Patch: add timing of buffer I/O requests