Skip site navigation (1) Skip section navigation (2)

RFC: Timing Events

From: Greg Smith <greg(at)2ndQuadrant(dot)com>
To: pgsql-hackers(at)postgresql(dot)org
Subject: RFC: Timing Events
Date: 2012-11-01 23:08:39
Message-ID: 509300F7.5000803@2ndQuadrant.com (view raw or flat)
Thread:
Lists: pgsql-hackers
Parsing log files for commonly needed performance data is no fun.  As a 
next step toward eliminating that, I've been working on how to approach 
this similarly to how pg_stat_statements can cut down on query log 
processing.  I thought it would be novel to outline this for design 
review before coding any more of it.  I keep switching between calling 
the sort of things people want to know about Timing Events and 
Performance Events.  No preference here, I'll use the former because 
it's shorter.

Modeling this on pg_stat_statements includes the hope of packaging it as 
an extension with minor core hooks, and the idea that there would be a 
fixed size list of timing events available at any time.  Consider it a 
FIFO queue built using a circular buffer.  Dump events into there and 
provide a way to view them.  If some fall out before they're 
analyzed/saved, that doesn't seem a problem for now.  If you want 100% 
durable, the log is still available.  Eventually I'd like a consumer for 
these that wrote them to a history table as an option, but that seems a 
second priority after getting them into memory in the first place.  I'd 
like that consumer thing for pg_stat_statements too, but so far we're 
getting by without it.  It seems like something that might benefit from 
the in-core queuing work one day too.

= Initial event list =

There are three particular events I have been thinking about, hoping 
that will be enough to make sure the structure is general enough to 
handle more later:

-log_lock_waits output.  This is my #1 goal.  Lock waits are incredibly 
common issues in larger systems, and the process for tracking them down 
right now is very painful.  Probably deadlocks too.

-log_temp_files output.  This everyone wants to monitor, but it won't be 
as useful unless query tieing (below) works.

-log_checkpoints.  The fight to make this data available via 
pg_stat_bgwriter has been useful.  You really need a trending system to 
do anything useful with the data that way though.  Being able to write a 
simple timing events query that looked for checkpoint spikes would be 
great for alerting, which is often done with something other than a 
trending package.  To give specific examples here, the graphs Munin 
makes for pg_stat_bgwriter data work, but they're not much of a help if 
what you really want is to trigger a Nagios alert when checkpoints are 
slow.  A Timing Event entry of a slow checkpoint would be easy to 
trigger alerts from.

= Data to save =

There's a few obvious fields that could be explicitly saved as regular 
columns:

-timestamp
-user
-database
-event type code.  Presumably an integer that's decoded via a master 
list of these, in some header file.
-running process id
-running process details (below)
-referenced process id.  The one causing the problem in situations like 
a lock wait.
-referenced process details (below)
-serial number.  People who will want to poll & store this data might 
want a better way to do that than just a timestamp.  Would let you know 
when you lost some of it, too, so you'd know to consider increasing the 
buffer size or decrease the polling interval.

= Storing event details =

When I think about how people are going to want to use this data, it 
should be easy to tear apart without returning to the need to write a 
complicated parser.

My first idea with that in mind was to consider this like a series of 
key/value pairs.  If you look at the log_checkpoints output, you can see 
it approaches that form already at the end, with "=" signs as the 
delimiters.  Taking an example from a pre-9.1 server (so no sync times, 
but they obviously fit the style already):

checkpoint complete: wrote 128770 buffers (12.3%); 0 transaction log 
file(s) added, 979 removed, 129 recycled; write=431.784 s, sync=6048.384 
s, total=6484.669 s

You can imagine that instead as:

wrote=128770 wrote_pct=12.3 log_added=0 log_removed=979 log_recycled=129 
recycled write=431.784 sync=6048.384 s total=6484.669

Even if the timing event data was stored exactly like that--simple text 
string, perhaps with some escaping of "=" signs--I think it would be 
completely usable to people.  log_temp_files output saves like this 
easily enough too.

Another way this data might be stored is to have a key/value composite 
type, and then make each timing event entry an array of those.

I'm open to other ideas for how to store this.  If the events go into 
one place, the different data each saves turns into the classic 
"flexible schema" problem.

= Query text =

For the often multi-line query text that's dumped by log_lock_waits, 
I've thought of two possibilities so far:

-Mash them into a single text line using "\n" between them.  Now you're 
even back to where, with the right escaping, the entire timing event 
detail part could still collapse into a single text string.  That would 
kill the idea of simple parsing of key/value data stuffed there though.

-Create an array of lines to hold the two statements.  Can always make 
the view that shows the data unnest it.  Seems like this would compel 
putting the two query description arrays into their own fields though.

I don't really like either of these.  The second should result in a 
better UI in the end though, at the expense of bloating the width of the 
event table some.

= Query Tieing =

I'd like to have a way to reference the normalized version of the active 
query or queries at the time when that's appropriate too.  That seems 
like it should be a key that's joinable with the normalized version in 
pg_stat_statements.  One of the ideas bounced around in the "Hash id in 
pg_stat_statements" thread, like assigning a numeric code for them, 
would seem fine for that.  This will turn dependent on one of those 
ideas being committed.

I have not confirmed how feasible it is to extract a query hash key at 
the point when the log messages show up yet.  ProcSleep logs the lock 
waiting and it grabs the full query text.  Presumably that means it 
could grab the hash key instead.

FileClose is what logs temp file data.  That seems pretty disconnected 
from what the executor is doing at the time.

If it's not possible to tie things to specific queries cleanly in every 
case, I'd rather dump that requirement for now than make it a blocker. 
That can always be added onto the basic timing event plumbing later.

= Limitations of scope =

I was thinking the act of loading the extension for this would turn the 
logging on, and the existing GUCs like log_lock_waits will also apply. 
It seems useful and easy to have a load time option for the module that 
says to disable all the replaced log file writes if you're using the 
timing event mechanism.  It would be nice to have a granular way to ask 
for some timing events:  "I want to see lock waits but not temp files". 
  That seems like scope creep I don't think is necessary to make this 
useful.

All these initial targets are events that do not occur that often, and 
I'm not too concerned about the performance of this implementation. 
Lock waits come out once per second per connection.  temp files 
shouldn't be that common.  And checkpoints are infrequent.  For now, I 
hope that only targeting infrequent events means the overhead of dumping 
the timing event into memory will be little overhead.  Eventually I'd 
like to have a much finer timing events capability, more along the lines 
of the DTrace level.  Getting overhead low enough to go there seems hard 
to attack at first though, while simultaneously dealing well with what 
I'm trying to extract without log file parsing.

Alternately, think about the overhead of these initial targets like 
this:  right now, all of these things are being written to the log file. 
  If you can afford to do that, you can certainly afford to write a copy 
to memory instead, and probably in addition too.

-- 
Greg Smith   2ndQuadrant US    greg(at)2ndQuadrant(dot)com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com


Responses

pgsql-hackers by date

Next:From: Josh BerkusDate: 2012-11-01 23:54:34
Subject: Re: RFC: Timing Events
Previous:From: Daniel FarinaDate: 2012-11-01 19:42:18
Subject: Re: Synchronous commit not... synchronous?

Privacy Policy | About PostgreSQL
Copyright © 1996-2014 The PostgreSQL Global Development Group