From: | Benoit Tigeot <benoit(dot)tigeot(at)gmail(dot)com> |
---|---|
To: | "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org> |
Subject: | pg_stat_statements: Add `calls_aborted` counter for tracking query cancellations |
Date: | 2025-08-12 14:54:10 |
Message-ID: | CAHUgstAuVpiSr1yRXtCR1mT5U9kvkur6P+kCs1M0dp1c_mDMUQ@mail.gmail.com |
Views: | Whole Thread | Raw Message | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
Hello
This is my first patch that I wanted to submit since nearly 2 years.
It adds a new `calls_aborted` counter to `pg_stat_statements` that
tracks queries terminated due to statement timeouts, user
cancellations, or other errors during execution. Unlike the existing
calls counter which only increments on successful completion,
`calls_aborted` provides visibility into failed query attempts,
enabling better monitoring of query reliability and timeout issues
without needing to examine logs. The counter is incremented in the
`ExecutorRun` `PG_FINALLY` block to capture queries that start
execution but fail to complete.
I work mostly as a web developer in environments with sensitive data.
Leveraging `pg_stat_statements` for cancelled and timed-out queries
instead of logs could be very beneficial for developers who want quick
access to an overview of problematic queries without having access to
all logs and more sensitive data (slow queries, auto explain, etc).
Status: For discussion
Branch: master
Testing: Compiles successfully and passes all regression tests,
including new tests for this feature
Performance impact:
Minimal performance impact. Only adds:
- One additional int64 field to the counters structure
- A hash table lookup and counter increment in the `PG_FINALLY` block
(only for aborted queries)
- No impact on successful query execution paths
Implementation rationale:
The patch tracks aborted queries specifically in `ExecutorRun's
PG_FINALLY` block because:
1. Execution focus: Queries that fail during parsing/planning are
fundamentally different from queries that start executing but get
cancelled
2. Practical monitoring: Operations teams probably care most about
queries that consume resources (start execution) but fail to complete.
It follows my earlier need of more "dev oriented" columns.
3. Consistent semantics: Matches the existing calls counter which
tracks execution completions even if this counter is quite "alone" as
it is not linking to other incremented columns.
The approach only increments `calls_aborted` for existing entries
(doesn't create new ones, using parsed) since queries that would be
tracked are already identified during earlier query phases.
Anticipated questions and responses
(Thanks to Michael Paquier and Tomas Vondra for early discussion on
this patch 🙌)
A. Why not use PGSS_ABORT kind?
pgssStoreKind is designed for successful operations (plans,
executions) or storing "setups" with PGSS_INVALID. For me aborted
calls don't have meaningful timing/resource statistics to store in the
arrays, and it would require extending ALL the timing/buffer/WAL
arrays for incomplete operations.
B. Could we collect additional stats about aborted runs (buffer/WAL
usage, etc.)?
I chose not to do this because it would require modifying execMain.c
and dealing with interrupted queries to gather partial information. It
seems quite ambitious to do as a first patch. I see `calls_aborted`
more as a probe to identify problematic queries for further
investigation in logs or at the application level.
C. Why not track earlier phases?
I deliberately focused `calls_aborted` on executor-level failures
rather than earlier phases (parser/analyzer/rewriter/planner) because
they serve different operational purposes. Earlier phase failures are
typically development-time errors (syntax mistakes, missing tables,
type mismatches) that don't provide actionable operational insights.
Executor aborts represent runtime operational issues (query timeouts,
cancellations, resource exhaustion, lock conflicts, etc.) that
indicate performance degradation or capacity problems requiring
attention. This design keeps the feature focused on what matters for
production monitoring: distinguishing between queries that "worked
before but now fail operationally" versus "never worked due to code
bugs." The implementation is also cleaner, avoiding the complexity of
hooking multiple subsystems and classifying different error types but
of course I may be wrong. ;)
Previous discussions:
- https://www.postgresql.org/message-id/20171112223906.dqtl3kk3pd7vn6yc%40alap3.anarazel.de
- https://www.postgresql.org/message-id/4671.1510537167%40sss.pgh.pa.us
- https://wiki.postgresql.org/wiki/PGConf.dev_2025_Developer_Unconference
"Could we count up timeouts?"
Thanks
__________
Benoit Tigeot
benoit(dot)tigeot(at)gmail(dot)com
Attachment | Content-Type | Size |
---|---|---|
patch_pgss_calls_aborted_v4.patch | application/octet-stream | 7.0 KB |
From | Date | Subject | |
---|---|---|---|
Next Message | Tomas Vondra | 2025-08-12 15:06:50 | Re: Adding basic NUMA awareness |
Previous Message | Andres Freund | 2025-08-12 14:31:32 | Re: Annoying warning in SerializeClientConnectionInfo |