Parsing log files for commonly needed performance data is no fun. As a
next step toward eliminating that, I've been working on how to approach
this similarly to how pg_stat_statements can cut down on query log
processing. I thought it would be novel to outline this for design
review before coding any more of it. I keep switching between calling
the sort of things people want to know about Timing Events and
Performance Events. No preference here, I'll use the former because
Modeling this on pg_stat_statements includes the hope of packaging it as
an extension with minor core hooks, and the idea that there would be a
fixed size list of timing events available at any time. Consider it a
FIFO queue built using a circular buffer. Dump events into there and
provide a way to view them. If some fall out before they're
analyzed/saved, that doesn't seem a problem for now. If you want 100%
durable, the log is still available. Eventually I'd like a consumer for
these that wrote them to a history table as an option, but that seems a
second priority after getting them into memory in the first place. I'd
like that consumer thing for pg_stat_statements too, but so far we're
getting by without it. It seems like something that might benefit from
the in-core queuing work one day too.
= Initial event list =
There are three particular events I have been thinking about, hoping
that will be enough to make sure the structure is general enough to
handle more later:
-log_lock_waits output. This is my #1 goal. Lock waits are incredibly
common issues in larger systems, and the process for tracking them down
right now is very painful. Probably deadlocks too.
-log_temp_files output. This everyone wants to monitor, but it won't be
as useful unless query tieing (below) works.
-log_checkpoints. The fight to make this data available via
pg_stat_bgwriter has been useful. You really need a trending system to
do anything useful with the data that way though. Being able to write a
simple timing events query that looked for checkpoint spikes would be
great for alerting, which is often done with something other than a
trending package. To give specific examples here, the graphs Munin
makes for pg_stat_bgwriter data work, but they're not much of a help if
what you really want is to trigger a Nagios alert when checkpoints are
slow. A Timing Event entry of a slow checkpoint would be easy to
trigger alerts from.
= Data to save =
There's a few obvious fields that could be explicitly saved as regular
-event type code. Presumably an integer that's decoded via a master
list of these, in some header file.
-running process id
-running process details (below)
-referenced process id. The one causing the problem in situations like
a lock wait.
-referenced process details (below)
-serial number. People who will want to poll & store this data might
want a better way to do that than just a timestamp. Would let you know
when you lost some of it, too, so you'd know to consider increasing the
buffer size or decrease the polling interval.
= Storing event details =
When I think about how people are going to want to use this data, it
should be easy to tear apart without returning to the need to write a
My first idea with that in mind was to consider this like a series of
key/value pairs. If you look at the log_checkpoints output, you can see
it approaches that form already at the end, with "=" signs as the
delimiters. Taking an example from a pre-9.1 server (so no sync times,
but they obviously fit the style already):
checkpoint complete: wrote 128770 buffers (12.3%); 0 transaction log
file(s) added, 979 removed, 129 recycled; write=431.784 s, sync=6048.384
s, total=6484.669 s
You can imagine that instead as:
wrote=128770 wrote_pct=12.3 log_added=0 log_removed=979 log_recycled=129
recycled write=431.784 sync=6048.384 s total=6484.669
Even if the timing event data was stored exactly like that--simple text
string, perhaps with some escaping of "=" signs--I think it would be
completely usable to people. log_temp_files output saves like this
easily enough too.
Another way this data might be stored is to have a key/value composite
type, and then make each timing event entry an array of those.
I'm open to other ideas for how to store this. If the events go into
one place, the different data each saves turns into the classic
"flexible schema" problem.
= Query text =
For the often multi-line query text that's dumped by log_lock_waits,
I've thought of two possibilities so far:
-Mash them into a single text line using "\n" between them. Now you're
even back to where, with the right escaping, the entire timing event
detail part could still collapse into a single text string. That would
kill the idea of simple parsing of key/value data stuffed there though.
-Create an array of lines to hold the two statements. Can always make
the view that shows the data unnest it. Seems like this would compel
putting the two query description arrays into their own fields though.
I don't really like either of these. The second should result in a
better UI in the end though, at the expense of bloating the width of the
event table some.
= Query Tieing =
I'd like to have a way to reference the normalized version of the active
query or queries at the time when that's appropriate too. That seems
like it should be a key that's joinable with the normalized version in
pg_stat_statements. One of the ideas bounced around in the "Hash id in
pg_stat_statements" thread, like assigning a numeric code for them,
would seem fine for that. This will turn dependent on one of those
ideas being committed.
I have not confirmed how feasible it is to extract a query hash key at
the point when the log messages show up yet. ProcSleep logs the lock
waiting and it grabs the full query text. Presumably that means it
could grab the hash key instead.
FileClose is what logs temp file data. That seems pretty disconnected
from what the executor is doing at the time.
If it's not possible to tie things to specific queries cleanly in every
case, I'd rather dump that requirement for now than make it a blocker.
That can always be added onto the basic timing event plumbing later.
= Limitations of scope =
I was thinking the act of loading the extension for this would turn the
logging on, and the existing GUCs like log_lock_waits will also apply.
It seems useful and easy to have a load time option for the module that
says to disable all the replaced log file writes if you're using the
timing event mechanism. It would be nice to have a granular way to ask
for some timing events: "I want to see lock waits but not temp files".
That seems like scope creep I don't think is necessary to make this
All these initial targets are events that do not occur that often, and
I'm not too concerned about the performance of this implementation.
Lock waits come out once per second per connection. temp files
shouldn't be that common. And checkpoints are infrequent. For now, I
hope that only targeting infrequent events means the overhead of dumping
the timing event into memory will be little overhead. Eventually I'd
like to have a much finer timing events capability, more along the lines
of the DTrace level. Getting overhead low enough to go there seems hard
to attack at first though, while simultaneously dealing well with what
I'm trying to extract without log file parsing.
Alternately, think about the overhead of these initial targets like
this: right now, all of these things are being written to the log file.
If you can afford to do that, you can certainly afford to write a copy
to memory instead, and probably in addition too.
Greg Smith 2ndQuadrant US greg(at)2ndQuadrant(dot)com Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com
pgsql-hackers by date
|Next:||From: Josh Berkus||Date: 2012-11-01 23:54:34|
|Subject: Re: RFC: Timing Events|
|Previous:||From: Daniel Farina||Date: 2012-11-01 19:42:18|
|Subject: Re: Synchronous commit not... synchronous?|