| From: | Dmitry Fomin <fomin(dot)list(at)gmail(dot)com> |
|---|---|
| To: | pgsql-hackers(at)lists(dot)postgresql(dot)org |
| Cc: | x4mmm(at)yandex-team(dot)ru, Nikolay Samokhvalov <nik(at)postgres(dot)ai>, wolakk(at)gmail(dot)com |
| Subject: | [PATCH v1 0/7] Wait event timing and tracing instrumentation |
| Date: | 2026-07-03 20:22:58 |
| Message-ID: | CAPHG-0mAOn05ae6Kqx1wHXxzOk4E5W7ajjd=QBhgkR7a0uyQmw@mail.gmail.com |
| Views: | Whole Thread | Raw Message | Download mbox | Resend email |
| Thread: | |
| Lists: | pgsql-hackers |
Hi hackers,
This series adds opt-in instrumentation that captures the duration of
every wait event reported through pgstat_report_wait_start()/_end(),
behind a configure-time flag (--enable-wait-event-timing) and a runtime
GUC (wait_event_capture = off | stats | trace, default off).
PostgreSQL exposes a rich taxonomy of wait events in pg_stat_activity,
but only as instantaneous snapshots: there is no in-core way to ask
"how long do my backends actually spend in each wait?", or "what was
the wait sequence of this session's last N events?". External tools
either sample at coarse resolution (pg_wait_sampling, default 10 ms) or
pay ~200-300 ns per transition via hardware watchpoints.
This is a reworked submission of an earlier 8k-line single patch,
following Andrey Borodin's advice to split it into independently
committable pieces with the DSA machinery deferred. The series has
three groups; each patch builds and passes check-world on its own:
Group 1 -- stats level (patches 1-4, committable on their own):
0001 adds the configure flag and the wait_event_capture GUC; pure
scaffolding, no behavior.
0002 records per-(backend, event) statistics -- count, total/max
duration, and a 32-bucket log2 histogram -- in eagerly-allocated
shared memory, with a single load+branch gate in the inline
wait_start/wait_end fast path and the recording bodies
out-of-line. Exposed via pg_stat_wait_event_timing. No DSA.
0003 exposes the overflow counters and adds reset functions (own
backend synchronous; cross-backend via a lock-free
request/response on an atomic generation counter).
0004 enables the flag on one CI task, so both the instrumented and
the stub build paths are exercised on every run.
Group 2 -- storage refactor:
0005 converts the per-backend array from eager shared memory to a
lazily-created DSA, so a build with the feature compiled in but
never enabled pays no per-backend memory. Pure refactor: the
SQL surface is unchanged and 0002/0003's tests pass as-is. This
is the riskiest patch -- it adds the lazy-attach guards
(critical sections, LWLock wait queues, re-entrancy) and the
proc_exit teardown gate -- which is exactly why it is isolated.
Group 3 -- trace level:
0006 adds wait_event_capture = trace: a per-session DSA ring buffer
of individual waits, readable from the owning session and
cross-backend (including post-mortem reads of rings whose
backend has exited). Cross-backend reads are protected by a
position-encoded identity seqlock; a TAP test drives an
injection point inside the writer to prove it rejects
stale-cycle reads that a parity-only seqlock would accept.
0007 interleaves query-attribution markers (executor and protocol
boundaries) into the ring, so a reader can tell which query
each wait belongs to.
Off-mode overhead: pgbench -S and TPC-B on a dedicated 16-vCPU box
show the compiled-in-but-off configuration within run-to-run noise of
an unpatched build (the inline gate is one global load and a predicted
branch; an earlier unlikely() annotation was dropped after measurement
showed byte-identical codegen without it). stats mode costs <= ~0.5%
vs off across read-only, TPC-B, and a wait-saturated
32MB-shared_buffers workload. I will post the full numbers and
methodology in a follow-up message.
Open questions:
* Whether the configure flag should eventually be dropped in favor of
always-compiled (the off-mode cost is one predicted branch).
* Defaults worth litigating: 32 histogram buckets,
wait_event_timing_max_tranches = 192, and
wait_event_trace_ring_size = 4MB.
* The trace level's orphan-ring retention tradeoff (post-mortem reads
vs bounded memory), documented on WaitEventTraceControl.
The patches deliberately contain no catversion bump; one is needed at
commit time for 0002, 0003 and 0006.
Thanks to Andrey Borodin for the first review of the earlier
single-patch version -- his advice to break it into independently
committable pieces shaped this series -- and to Nikolay Samokhvalov and
Kirk Wolak for discussion and independent testing of the overhead and
the orphaned-ring lifecycle.
Regards,
Dmitry Fomin
Dmitry Fomin (7):
wait_event_timing: add --enable-wait-event-timing flag and
wait_event_capture GUC
wait_event_timing: record per-backend wait event statistics (stats level)
wait_event_timing: expose overflow counters and add reset functions
ci: build one task with --enable-wait-event-timing
wait_event_timing: allocate the per-backend array lazily in DSA
wait_event_timing: add trace level with a per-session ring buffer
wait_event_timing: add query-attribution markers to the trace ring
.github/workflows/pg-ci.yml | 7 +
configure | 32 +
configure.ac | 8 +
doc/src/sgml/config.sgml | 106 +
doc/src/sgml/monitoring.sgml | 783 +++++
meson.build | 1 +
meson_options.txt | 3 +
src/backend/catalog/system_views.sql | 103 +
src/backend/executor/execMain.c | 5 +
src/backend/postmaster/auxprocess.c | 11 +
src/backend/storage/lmgr/proc.c | 5 +
src/backend/tcop/postgres.c | 39 +
src/backend/utils/.gitignore | 1 +
src/backend/utils/Makefile | 9 +-
src/backend/utils/activity/Makefile | 4 +-
src/backend/utils/activity/backend_status.c | 13 +
.../activity/generate-wait_event_types.pl | 179 +
src/backend/utils/activity/meson.build | 1 +
src/backend/utils/activity/wait_event.c | 3 +-
.../utils/activity/wait_event_names.txt | 2 +
.../utils/activity/wait_event_timing.c | 3111 +++++++++++++++++
src/backend/utils/init/postinit.c | 11 +
src/backend/utils/misc/guc_parameters.dat | 29 +
src/backend/utils/misc/guc_tables.c | 1 +
src/backend/utils/misc/postgresql.conf.sample | 5 +
src/include/catalog/pg_proc.dat | 59 +
src/include/pg_config.h.in | 3 +
src/include/storage/lwlocklist.h | 2 +
src/include/storage/subsystemlist.h | 2 +
src/include/utils/.gitignore | 1 +
src/include/utils/guc.h | 1 +
src/include/utils/guc_hooks.h | 3 +
src/include/utils/meson.build | 4 +-
src/include/utils/wait_classes.h | 9 +
src/include/utils/wait_event.h | 49 +
src/include/utils/wait_event_timing.h | 374 ++
src/test/modules/test_misc/meson.build | 1 +
.../t/015_wait_event_trace_seqlock.pl | 122 +
src/test/regress/expected/rules.out | 30 +
.../regress/expected/wait_event_timing.out | 188 +
.../regress/expected/wait_event_timing_1.out | 181 +
src/test/regress/parallel_schedule | 4 +
src/test/regress/sql/wait_event_timing.sql | 113 +
src/tools/pgindent/typedefs.list | 12 +
44 files changed, 5621 insertions(+), 9 deletions(-)
create mode 100644 src/backend/utils/activity/wait_event_timing.c
create mode 100644 src/include/utils/wait_event_timing.h
create mode 100644 src/test/modules/test_misc/t/015_wait_event_trace_seqlock.pl
create mode 100644 src/test/regress/expected/wait_event_timing.out
create mode 100644 src/test/regress/expected/wait_event_timing_1.out
create mode 100644 src/test/regress/sql/wait_event_timing.sql
| Attachment | Content-Type | Size |
|---|---|---|
| v1-0003-wait_event_timing-expose-overflow-counters-and-ad.patch | application/octet-stream | 33.3 KB |
| v1-0001-wait_event_timing-add-enable-wait-event-timing-fl.patch | application/octet-stream | 15.8 KB |
| v1-0002-wait_event_timing-record-per-backend-wait-event-s.patch | application/octet-stream | 87.1 KB |
| v1-0004-ci-build-one-task-with-enable-wait-event-timing.patch | application/octet-stream | 1.8 KB |
| v1-0005-wait_event_timing-allocate-the-per-backend-array-.patch | application/octet-stream | 24.8 KB |
| v1-0007-wait_event_timing-add-query-attribution-markers-t.patch | application/octet-stream | 20.4 KB |
| v1-0006-wait_event_timing-add-trace-level-with-a-per-sess.patch | application/octet-stream | 132.1 KB |
| From | Date | Subject | |
|---|---|---|---|
| Next Message | Dmitry Fomin | 2026-07-03 20:26:42 | Re: [PATCH v1 0/7] Wait event timing and tracing instrumentation |
| Previous Message | Zsolt Parragi | 2026-07-03 20:21:37 | Re: SLOPE - Planner optimizations on monotonic expressions. |