RFC: built-in historical query time profiling

From: "Ed L(dot)" <pgsql(at)bluepolka(dot)net>
To: pgsql-hackers(at)postgresql(dot)org
Subject: RFC: built-in historical query time profiling
Date: 2005-03-23 20:32:05
Message-ID: 200503231332.05912.pgsql@bluepolka.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers


Hackers,

I'd like to pose a problem we are facing (historical query time
profiling) and see if any of you interested backend gurus have
an opinion on the promise or design of a built-in backend
solution (optional built-in historical query time stats), and/or
willingness to consider such a patch submission.

Our Problem: We work with 75+ geographically distributed pg
clusters; it is a significant challenge keeping tabs on
performance. We see degradations from rogue applications,
vacuums, dumps, bloating indices, I/O and memory shortages, and
so on. Customers don't generally tell us when applications are
slow, so we need to know for ourselves in a timely manner. At
present, we can remotely and systematically query system
relations for diskspace usage, detailed I/O usage,
index/sequential scans, and more. But our _ultimate_ DB
performance measure is query execution time. Obviously, you can
measure that now in an ad hoc fashion with EXPLAIN ANALYZE, and
by examining historical logs. But we need to be able to see the
history in a timely fashion to systematically identify
customer-experienced execution time degradations for "query
patterns of interest" without any visual log inspection
whatsoever, and correlate those with other events. We can do
this by writing programs to periodically parse log files for
queries and durations, and then centralizing that information
into a db for analysis, similar to pqa's effort. Short of a
backend solution, that's what we'll do.

Backend solution?

But being able to query the database itself for historical
execution time statistics for "query patterns of interest" is
very attractive. Such functionality would seem generally very
useful for other deployments. Below is a rough novice sketch of
an **optional** scheme for doing so in the backend (I'm sure
it's incomplete/faulty in this presentation; I'm really trying
to determine if there are any fatal short-comings over the
log-parsing approach).

Suppose there were some sort of system relations like these:

pg_query_profile (
id integer,
name varchar not null unique,
sql_regex varchar not null unique,
enabled boolean not null
)

pg_query_profile_history (
profile_id integer not null, -- refs pg_query_profile.id
count integer, -- number of matches in period
avgdur float not null, -- avg duration in secs
mindur float not null, -- min duration
maxdur float not null, -- max duration
errors bigint not null, -- errors in period
period_start timestamp not null,
period_end timestamp not null
)

Each row in pg_query_profile_history would represent execution
time stats for queries matching a given regex for a given
interval. The sql_regex column would be a user-specified value
matching "queries of interest". For example, if I were
interested in profiling all queries of the form

"SELECT * FROM result WHERE key = 123"

, then maybe my sql_regex would basically be

INSERT INTO pg_query_profile (name, sql_regex)
VALUES ('Result Queries',
'SELECT * FROM result WHERE key = \d+');

Then, as each query completed, that query was (optionally!)
checked against existing pg_query_profile.sql_regex values for a
patten match, and any matching pg_query_profile rows for that
period were then updated with the duration data. I can imagine
wishing to collect this data for 10-20 most-common queries in
5-minute intervals for the past 24 hours or so.

One could then systematically identify degradations beyond 1.0
seconds with a query similar to the following:

SELECT COUNT(1)
FROM pg_query_profile_view
WHERE name = 'Result Queries'
AND avgdur > 1.0;

Once the data is there, it opens up a lot of possibilities for
systematic monitoring.

Some possible objections (O) and answers (A):

1) O: But wouldn't this impose too much overhead in the backend
for transactions for folks who don't want/need this feature? A:
Not if it were completely optional, right?

2) O: If enabled, there is no way you'd want to impose an
update query on each select query! A: True. I envision the
query profile as cached in shared memory and only written to
disk a user-configurable "every so often".

3) O: Regular expression evaluation is computationally
expensive! A: I'm imagining it might add a few milliseconds to
each query, which would be well worth the benefit to us in
having the most important metric easily accessible.

GUC variables might include:

query_profile : boolean on/off for profiling

query_profile_interval : how often to write out stats
Example: Make each profile row represent 5 minutes
query_profile_interval = 300

query_profile_window : how long to keep stats
Example: Keep data for past 24 hours
query_profile_window = 86400

query_profile_cache_size : Max size of profiling cache
Hard limit on how much we'll cache

Thanks for your consideration.

Ed

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2005-03-23 22:34:46 Re: RFC: built-in historical query time profiling
Previous Message Oleg Bartunov 2005-03-23 20:19:21 Re: more detailed timing ?