BUG #14901: Canceled queries missing from pg_stat_statements

From: gregburek(at)heroku(dot)com
To: pgsql-bugs(at)postgresql(dot)org
Cc: gregburek(at)heroku(dot)com
Subject: BUG #14901: Canceled queries missing from pg_stat_statements
Date: 2017-11-12 22:28:11
Message-ID: 20171112222811.1464.28388@wrigleys.postgresql.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

The following bug has been logged on the website:

Bug reference: 14901
Logged by: Greg Burek
Email address: gregburek(at)heroku(dot)com
PostgreSQL version: 9.5.10
Operating system: linux
Description:

Hello,
I recently worked with an app that hit a mysterious wall in Postgres query
performance, where overall query latency had increased dramatically, with
sporadic statement_timeout producing app errors. Examining
pg_stat_statements showed some queries that were consuming resources, which
were then optimized, but the overall query latency and app error rate stayed
constant.

Only after examining logs did we find a class of query that was being
consistently canceled by statement_timeout=30s. This query was found mostly
by chance as it was a fraction of the total statement_timeout rate, as some
sessions have much more aggressive statement_timeout=2s. Running explain
analyze on this 30s timeout query, we found it to be extremely resource
heavy and ultimately causing the overall degredation of response time.

This was a surprise to the app's developers that a query, which had a large
execution time but was being canceled by a relatively sane
statement_timeout, was missing from pg_stat_statements. That one would need
to access, consume and analyze postgres logs was unexpected, given Postgres
documents
(https://www.postgresql.org/docs/current/static/pgstatstatements.html) that
"The pg_stat_statements module provides a means for tracking execution
statistics of *all* SQL statements executed by a server." That canceled
queries are missing from this view seems like an easy to miss, but vital
exception.

I think this is either a bug in how pgss does not account for canceled
queries that still exert load on a database or in the documentation of pgss
that it only accounts for completed queries and that statement_timeout or
app timeouts can result in query undercounting or omissions.

Below is a reproduction of this issue, where a pathologically bad query,
which is responsible for the bulk of the load executed on the db, does not
increment or add execution statistics to pg_stat_statements.

I would expect that the final `select * from test order by random() limit
?;` entry of pg_stat_statements would have a count equal 7 and total_time
around 142623.038.

=# create extension "uuid-ossp";
CREATE EXTENSION
Time: 10.053 ms

=# create extension "pg_stat_statements";
CREATE EXTENSION
Time: 23.607 ms

=# create table test as (
SELECT id as id, uuid_generate_v4() as uuid, uuid_generate_v1() as
name
FROM generate_series(1,(10^8)::int) as id);
SELECT 100000000
Time: 276467.422 ms (04:36.467)

=# select pg_stat_statements_reset();
pg_stat_statements_reset
--------------------------

(1 row)

Time: 0.867 ms

=# select query, calls from pg_stat_statements;
query | calls
------------------------------------+-------
select pg_stat_statements_reset(); | 1
(1 row)

Time: 1.221 ms

=# select * from test order by random() limit 1;
id | uuid | name
----------+--------------------------------------+--------------------------------------
79600490 | 46844417-f3ee-4add-b5c5-1ba3e6db5da5 |
e4c118b2-c5a3-11e7-be4b-8c8590305665
(1 row)

Time: 22627.277 ms (00:22.627)

=# select query, calls from pg_stat_statements;
query | calls
-----------------------------------------------+-------
select query, calls from pg_stat_statements; | 1
select pg_stat_statements_reset(); | 1
select * from test order by random() limit ?; | 1
(3 rows)

Time: 1.439 ms

=# SET statement_timeout='20s';
SET
Time: 8.936 ms

=# select * from test order by random() limit 1;
ERROR: 57014: canceling statement due to statement timeout
LOCATION: ProcessInterrupts, postgres.c:2971
Time: 20073.747 ms (00:20.074)

=# select * from test order by random() limit 1;
ERROR: 57014: canceling statement due to statement timeout
LOCATION: ProcessInterrupts, postgres.c:2971
Time: 20000.455 ms (00:20.000)

=# select * from test order by random() limit 1;
ERROR: 57014: canceling statement due to statement timeout
LOCATION: ProcessInterrupts, postgres.c:2971
Time: 20065.561 ms (00:20.066)

=# select * from test order by random() limit 1;
ERROR: 57014: canceling statement due to statement timeout
LOCATION: ProcessInterrupts, postgres.c:2971
Time: 20019.753 ms (00:20.020)

=# select * from test order by random() limit 1;
ERROR: 57014: canceling statement due to statement timeout
LOCATION: ProcessInterrupts, postgres.c:2971
Time: 20056.721 ms (00:20.057)

=# select * from test order by random() limit 1;
ERROR: 57014: canceling statement due to statement timeout
LOCATION: ProcessInterrupts, postgres.c:2971
Time: 20001.994 ms (00:20.002)

=# select query, calls, total_time from pg_stat_statements;
query | calls | total_time
-----------------------------------------------+-------+------------
select query, calls from pg_stat_statements; | 2 | 2.063
SET statement_timeout='20s'; | 1 | 0.857
select pg_stat_statements_reset(); | 1 | 2.753
select * from test order by random() limit ?; | 1 | 22623.038
(4 rows)

Time: 2.536 ms

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message Andres Freund 2017-11-12 22:39:06 Re: BUG #14901: Canceled queries missing from pg_stat_statements
Previous Message Tom Lane 2017-11-12 17:43:02 Re: BUG #14897: Segfault on statitics SQL request