| From: | Dmitry Fomin <fomin(dot)list(at)gmail(dot)com> |
|---|---|
| To: | pgsql-hackers(at)lists(dot)postgresql(dot)org |
| Cc: | x4mmm(at)yandex-team(dot)ru, Nikolay Samokhvalov <nik(at)postgres(dot)ai>, wolakk(at)gmail(dot)com |
| Subject: | Re: [PATCH v1 0/7] Wait event timing and tracing instrumentation |
| Date: | 2026-07-03 20:26:42 |
| Message-ID: | CAPHG-0nFd3KW6hq1uS=cHuGmrQJ7Y5pzaGjjji9PQPj1g53DrA@mail.gmail.com |
| Views: | Whole Thread | Raw Message | Download mbox | Resend email |
| Thread: | |
| Lists: | pgsql-hackers |
Hi hackers,
As promised in the cover letter, here are the overhead numbers and the
methodology behind them.
Hardware and setup
------------------
Dedicated Hetzner CCX43 (16 dedicated AMD EPYC Milan vCPUs, 64 GB RAM),
Linux, gcc 13, -O2 (meson debugoptimized), no other load. Four builds:
BASELINE unpatched master
WET/off patched, --enable-wait-event-timing, wait_event_capture=off
WET/stats same build, wait_event_capture=stats
WET/trace same build, wait_event_capture=trace (4 MB rings)
pgbench scale 100, shared_buffers = 4 GB unless noted, 16 clients /
8 threads / 30 s, 5 runs per cell, means reported. Run-to-run spread
on this machine is roughly +/- 1-2 %, so single-digit deltas below
should be read against that floor.
The numbers were taken on the pre-split development tip of this work.
The v1 series is a reorganization of that same code into reviewable
pieces; the recording hot paths are unchanged. The off- and stats-mode
results were additionally re-validated on the series itself (same
machine, same protocol) with matching results.
1. pgbench -S (read-only)
-------------------------
mean TPS vs BASELINE
BASELINE 164,859 --
WET/off 166,840 +1.20 %
WET/stats 164,126 -0.44 %
WET/trace 165,687 +0.50 %
All three modes are within run-to-run noise of vanilla. (Yes, "off"
came out faster than baseline here -- see the layout note below for
why we do not read anything into that.)
2. pgbench TPC-B
----------------
mean TPS vs BASELINE
BASELINE 17,569 --
WET/off 17,487 -0.46 %
WET/stats 17,240 -1.87 %
WET/trace 17,314 -1.45 %
3. Wait-saturated worst case (eviction stress)
----------------------------------------------
To probe the per-transition cost directly, shared_buffers = 32 MB
against a ~1.5 GB working set forces ~98 % buffer misses; every miss
produces a BufferIO/DataFileRead wait. This drives the instrumented
path at the highest rate we could construct:
scenario mode TPS vs off
W1 -S, 16 clients off 116,487 --
stats 116,625 +0.1 %
trace 114,844 -1.4 %
W2 TPC-B, 16 clients off 16,813 --
stats 16,766 -0.3 %
trace 16,698 -0.7 %
W3 -S, 32 clients (2x oversub) off 162,845 --
stats 158,880 -2.4 %
trace 159,141 -2.3 %
W4 hot-row UPDATE, 32 clients off 8,138 --
stats 7,772 -4.5 %
trace 7,814 -4.0 %
W4 is the deliberate pathological case: every transaction contends on
the same row, so each transaction goes through many lock-wait
transitions. The ~4.5 % there is the honest ceiling we could produce
for the recording cost; it is bounded and proportional to the
wait-transition rate, which is exactly the workload where you would
want this data. Typical stats-mode cost elsewhere is <= ~0.5 % vs off.
A sample of what trace mode captured under W1, as a sanity check that
the instrumented path was actually exercised:
IO / DataFileRead (dominant), IO / WalWrite, IO / AioIoCompletion,
Activity / BgwriterMain, Activity / WalWriterMain,
LWLock / WaitEventTraceDSA
4. A methodological caution: binary layout
------------------------------------------
While validating the off-mode gate we hit an apparent -3 % TPC-B
regression that survived re-runs within one session -- and then
reversed sign in a fresh session with the identical binary (-3.00 %,
then +0.18 %). A control build with the gate #ifdef'd out entirely
(same patch, no live code in the hot path) measured -2.39 %, while
re-adding the gate measured +2.64 % -- a slowdown from removing an
instruction and a speedup from adding one, which is impossible as an
instruction cost. This is the well-known code-layout effect
(Mytkowicz et al., "Producing Wrong Data Without Doing Anything
Obviously Wrong", ASPLOS 2009): TPC-B on this machine is sensitive to
where unrelated code lands in the binary at the +/- 2-3 % level.
Consequently: (a) we report off-mode as "within noise" rather than
quoting a signed per-mille number, and (b) the earlier unlikely()
annotation on the gate was dropped after objdump showed gcc 13 -O2
emits byte-identical code with and without it (the measured "effect"
of the annotation was pure layout). The gate itself is one global
load and a predicted branch; arithmetic puts it around 10^-5 of a
TPC-B transaction, far below anything measurable here.
5. Stability under the same workloads
-------------------------------------
Not performance, but run on the same box and worth one paragraph: the
full check-world passes on the instrumented and stub builds; an
ASAN-instrumented build ran the pgbench matrix, a connect storm, a
SELECT FOR UPDATE pile-up and a pg_terminate_backend storm with no
findings; stress runs covered immediate-shutdown recovery, rapid mode
flips under load, 150 concurrent cross-backend trace readers against a
live writer, 8 KB and 32 MB rings, an orphan-sweep of 50 short-lived
backends, and a primary + hot-standby pair with trace enabled on both.
No crashes or sanitizer reports in any of it.
Happy to re-run any of this with different parameters, or publish the
driver scripts if that is useful.
Regards,
Dmitry Fomin
On Fri, Jul 3, 2026 at 10:22 PM Dmitry Fomin <fomin(dot)list(at)gmail(dot)com> wrote:
>
> Hi hackers,
>
> This series adds opt-in instrumentation that captures the duration of
> every wait event reported through pgstat_report_wait_start()/_end(),
> behind a configure-time flag (--enable-wait-event-timing) and a runtime
> GUC (wait_event_capture = off | stats | trace, default off).
>
> PostgreSQL exposes a rich taxonomy of wait events in pg_stat_activity,
> but only as instantaneous snapshots: there is no in-core way to ask
> "how long do my backends actually spend in each wait?", or "what was
> the wait sequence of this session's last N events?". External tools
> either sample at coarse resolution (pg_wait_sampling, default 10 ms) or
> pay ~200-300 ns per transition via hardware watchpoints.
>
> This is a reworked submission of an earlier 8k-line single patch,
> following Andrey Borodin's advice to split it into independently
> committable pieces with the DSA machinery deferred. The series has
> three groups; each patch builds and passes check-world on its own:
>
> Group 1 -- stats level (patches 1-4, committable on their own):
>
> 0001 adds the configure flag and the wait_event_capture GUC; pure
> scaffolding, no behavior.
> 0002 records per-(backend, event) statistics -- count, total/max
> duration, and a 32-bucket log2 histogram -- in eagerly-allocated
> shared memory, with a single load+branch gate in the inline
> wait_start/wait_end fast path and the recording bodies
> out-of-line. Exposed via pg_stat_wait_event_timing. No DSA.
> 0003 exposes the overflow counters and adds reset functions (own
> backend synchronous; cross-backend via a lock-free
> request/response on an atomic generation counter).
> 0004 enables the flag on one CI task, so both the instrumented and
> the stub build paths are exercised on every run.
>
> Group 2 -- storage refactor:
>
> 0005 converts the per-backend array from eager shared memory to a
> lazily-created DSA, so a build with the feature compiled in but
> never enabled pays no per-backend memory. Pure refactor: the
> SQL surface is unchanged and 0002/0003's tests pass as-is. This
> is the riskiest patch -- it adds the lazy-attach guards
> (critical sections, LWLock wait queues, re-entrancy) and the
> proc_exit teardown gate -- which is exactly why it is isolated.
>
> Group 3 -- trace level:
>
> 0006 adds wait_event_capture = trace: a per-session DSA ring buffer
> of individual waits, readable from the owning session and
> cross-backend (including post-mortem reads of rings whose
> backend has exited). Cross-backend reads are protected by a
> position-encoded identity seqlock; a TAP test drives an
> injection point inside the writer to prove it rejects
> stale-cycle reads that a parity-only seqlock would accept.
> 0007 interleaves query-attribution markers (executor and protocol
> boundaries) into the ring, so a reader can tell which query
> each wait belongs to.
>
> Off-mode overhead: pgbench -S and TPC-B on a dedicated 16-vCPU box
> show the compiled-in-but-off configuration within run-to-run noise of
> an unpatched build (the inline gate is one global load and a predicted
> branch; an earlier unlikely() annotation was dropped after measurement
> showed byte-identical codegen without it). stats mode costs <= ~0.5%
> vs off across read-only, TPC-B, and a wait-saturated
> 32MB-shared_buffers workload. I will post the full numbers and
> methodology in a follow-up message.
>
> Open questions:
>
> * Whether the configure flag should eventually be dropped in favor of
> always-compiled (the off-mode cost is one predicted branch).
> * Defaults worth litigating: 32 histogram buckets,
> wait_event_timing_max_tranches = 192, and
> wait_event_trace_ring_size = 4MB.
> * The trace level's orphan-ring retention tradeoff (post-mortem reads
> vs bounded memory), documented on WaitEventTraceControl.
>
> The patches deliberately contain no catversion bump; one is needed at
> commit time for 0002, 0003 and 0006.
>
> Thanks to Andrey Borodin for the first review of the earlier
> single-patch version -- his advice to break it into independently
> committable pieces shaped this series -- and to Nikolay Samokhvalov and
> Kirk Wolak for discussion and independent testing of the overhead and
> the orphaned-ring lifecycle.
>
> Regards,
> Dmitry Fomin
>
> Dmitry Fomin (7):
> wait_event_timing: add --enable-wait-event-timing flag and
> wait_event_capture GUC
> wait_event_timing: record per-backend wait event statistics (stats level)
> wait_event_timing: expose overflow counters and add reset functions
> ci: build one task with --enable-wait-event-timing
> wait_event_timing: allocate the per-backend array lazily in DSA
> wait_event_timing: add trace level with a per-session ring buffer
> wait_event_timing: add query-attribution markers to the trace ring
>
> .github/workflows/pg-ci.yml | 7 +
> configure | 32 +
> configure.ac | 8 +
> doc/src/sgml/config.sgml | 106 +
> doc/src/sgml/monitoring.sgml | 783 +++++
> meson.build | 1 +
> meson_options.txt | 3 +
> src/backend/catalog/system_views.sql | 103 +
> src/backend/executor/execMain.c | 5 +
> src/backend/postmaster/auxprocess.c | 11 +
> src/backend/storage/lmgr/proc.c | 5 +
> src/backend/tcop/postgres.c | 39 +
> src/backend/utils/.gitignore | 1 +
> src/backend/utils/Makefile | 9 +-
> src/backend/utils/activity/Makefile | 4 +-
> src/backend/utils/activity/backend_status.c | 13 +
> .../activity/generate-wait_event_types.pl | 179 +
> src/backend/utils/activity/meson.build | 1 +
> src/backend/utils/activity/wait_event.c | 3 +-
> .../utils/activity/wait_event_names.txt | 2 +
> .../utils/activity/wait_event_timing.c | 3111 +++++++++++++++++
> src/backend/utils/init/postinit.c | 11 +
> src/backend/utils/misc/guc_parameters.dat | 29 +
> src/backend/utils/misc/guc_tables.c | 1 +
> src/backend/utils/misc/postgresql.conf.sample | 5 +
> src/include/catalog/pg_proc.dat | 59 +
> src/include/pg_config.h.in | 3 +
> src/include/storage/lwlocklist.h | 2 +
> src/include/storage/subsystemlist.h | 2 +
> src/include/utils/.gitignore | 1 +
> src/include/utils/guc.h | 1 +
> src/include/utils/guc_hooks.h | 3 +
> src/include/utils/meson.build | 4 +-
> src/include/utils/wait_classes.h | 9 +
> src/include/utils/wait_event.h | 49 +
> src/include/utils/wait_event_timing.h | 374 ++
> src/test/modules/test_misc/meson.build | 1 +
> .../t/015_wait_event_trace_seqlock.pl | 122 +
> src/test/regress/expected/rules.out | 30 +
> .../regress/expected/wait_event_timing.out | 188 +
> .../regress/expected/wait_event_timing_1.out | 181 +
> src/test/regress/parallel_schedule | 4 +
> src/test/regress/sql/wait_event_timing.sql | 113 +
> src/tools/pgindent/typedefs.list | 12 +
> 44 files changed, 5621 insertions(+), 9 deletions(-)
> create mode 100644 src/backend/utils/activity/wait_event_timing.c
> create mode 100644 src/include/utils/wait_event_timing.h
> create mode 100644 src/test/modules/test_misc/t/015_wait_event_trace_seqlock.pl
> create mode 100644 src/test/regress/expected/wait_event_timing.out
> create mode 100644 src/test/regress/expected/wait_event_timing_1.out
> create mode 100644 src/test/regress/sql/wait_event_timing.sql
| From | Date | Subject | |
|---|---|---|---|
| Next Message | Tom Lane | 2026-07-03 21:33:13 | Re: [PATCH] Fix for bug #19474: LIKE fails to match literal backslashes with nondeterministic collations |
| Previous Message | Dmitry Fomin | 2026-07-03 20:22:58 | [PATCH v1 0/7] Wait event timing and tracing instrumentation |