Re: auto_explain WAS: RFC: Timing Events

From: Gavin Flower <GavinFlower(at)archidevsys(dot)co(dot)nz>
To: Greg Smith <greg(at)2ndQuadrant(dot)com>
Cc: Josh Berkus <josh(at)agliodbs(dot)com>, Jeff Janes <jeff(dot)janes(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: auto_explain WAS: RFC: Timing Events
Date: 2012-11-21 23:33:10
Message-ID: 50AD64B6.4030607@archidevsys.co.nz
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 22/11/12 12:15, Greg Smith wrote:
> On 11/8/12 2:16 PM, Josh Berkus wrote:
>
>> Also, logging only the long-running queries is less useful than people
>> on this list seem to think. When I'm doing real performance analysis, I
>> need to see *everything* which was run, not just the slow stuff. Often
>> the real problem is a query which used to take 1.1ms, now takes 1.8ms,
>> and gets run 400 times/second. Looking just at the slow queries won't
>> tell you that.
>
> No argument here. I've tried to be clear that some of these
> high-speed but lossy things I'm hacking on are not going to be useful
> for everyone. A solution that found most of these 1.8ms queries, but
> dropped some percentage under the highest peak load conditions, would
> still be very useful to me.
>
> An anecdote on this topic seems relevant. I have a troublesome
> production server that has moved log_min_duration_statement from 100ms
> to 500ms to 2000ms as the system grew. Even the original setting
> wasn't short enough to catch everything we would like to watch *now*,
> but seeing sub-second data is a dream at this point. The increases
> have been forced by logging contention becoming unmanagable when every
> client has to fight for the log to write to disk. I can see the
> buggers stack up as waiting for locks if I try to log shorter
> statements, stalling enough that it drags the whole server down under
> peak load.
>
> If I could just turn off logging just during those periods--basically,
> throwing them away only when some output rate throttled component hit
> its limit--I could still find them in the data collected the rest of
> the time. There are some types of problems that also only occur under
> peak load that this idea would miss, but you'd still be likely to get
> *some* of them, statistically.
>
> There's a really hard trade-off here:
>
> -Sometimes you must save data on every query to capture fine details
> -Making every query wait for disk I/O is impractical
>
> The sort of ideas you threw out for making things like auto-explain
> logging per-session can work. The main limitation there though is
> that it presumes you even know what area the problem is in the first
> place. I am not optimistic that covers a whole lot of ground either.
>
> Upthread I talked about a background process that persists shared
> memory queues as a future consumer of timing events--one that might
> consume slow query data too. That is effectively acting as the ideal
> component I described above, one that only loses things when it
> exceeds the system's write capacity for saving them. I wouldn't want
> to try and retrofit the existing PostgreSQL logging facility for such
> a thing though. Log parsing as the way to collect data is filled with
> headaches anyway.
>
> I don't see any other good way to resolve this trade-off. To help
> with the real-world situation you describe, ideally you dump all the
> query data somewhere, fast, and have the client move on. You can't
> make queries wait for storage, something else (with a buffer!) needs
> to take over that job.
>
> I can't find the URL right now, but at PG.EU someone was showing me a
> module that grabbed the new 9.2 logging hook and shipped the result to
> another server. Clients burn a little CPU and network time and they
> move on, and magically disk I/O goes out of their concerns. How much
> overhead persisting the data takes isn't the database server's job at
> all then. That's the sort of flexibility people need to have with
> logging eventually. Something so efficient that every client can
> afford to do it; it is capable of saving all events under ideal
> conditions; but under adverse ones, you have to keep going and accept
> the loss.
>
Would it be useful to be able to specify a fixed sampling rate, say 1 in
100. That way you could get a well defined statistical sample, yet not
cause excessive I/O?

Cheers,
Gavin

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Michael Paquier 2012-11-22 00:13:30 Re: logical changeset generation v3
Previous Message Andres Freund 2012-11-21 23:25:55 Re: logical changeset generation v3