| From: | Alena Rybakina <lena(dot)ribackina(at)yandex(dot)ru> |
|---|---|
| To: | pgsql-hackers <pgsql-hackers(at)postgresql(dot)org> |
| Cc: | Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Jim Nasby <jnasby(at)upgrade(dot)com>, Bertrand Drouvot <bertranddrouvot(dot)pg(at)gmail(dot)com>, Kirill Reshke <reshkekirill(at)gmail(dot)com>, Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>, Melanie Plageman <melanieplageman(at)gmail(dot)com>, jian he <jian(dot)universality(at)gmail(dot)com>, Sami Imseih <samimseih(at)gmail(dot)com>, vignesh C <vignesh21(at)gmail(dot)com>, Alexander Korotkov <aekorotkov(at)gmail(dot)com>, Ilia Evdokimov <ilya(dot)evdokimov(at)tantorlabs(dot)com>, Andrey Borodin <x4mmm(at)yandex-team(dot)ru>, Andrei Zubkov <zubkov(at)moonset(dot)ru>, Andrei Lepikhov <lepihov(at)gmail(dot)com> |
| Subject: | Re: Vacuum statistics |
| Date: | 2026-04-28 02:16:29 |
| Message-ID: | 5bc44024-add6-40fd-a8e6-6c4559e09478@yandex.ru |
| Views: | Whole Thread | Raw Message | Download mbox | Resend email |
| Thread: | |
| Lists: | pgsql-hackers |
Hi, all!
I have updated the core patch that implements the machinery for
collecting extended vacuum statistics (I didn't touch the first patch
that is ready for commit, only patches that are related to extension),
and rebased the ext_vacuum_statistics extension on top of it. The split
is intentional: the core only gathers metrics and hands them out, while
the actual storage and SQL-level access to the statistics live entirely
in the extension. If the extension is not loaded, the overhead is
essentially zero - we only fill a small struct on the stack and do a
NULL check on the hook.
What was updated in the core
The core gains the machinery and the hook through which the extension
receives metrics after each vacuum.
The hook. A new hook has been added in pgstat - set_report_vacuum_hook.
It is fired once per vacuumed table and once per vacuumed index, plus
when forming the per-database aggregate. The extension registers its
handler in _PG_init and by default the hook is NULL, so without an
extension the core behaves exactly as before.
The set of statistics is the same as before. Common to tables, indexes
and the database - hits and misses in shared buffers, number of dirtied
and written pages, WAL volume, buffer read and write times, sleep time
spent in delay points, total wall-clock vacuum time (including I/O and
lock waits), counter of emergency anti-wraparound vacuums, number of
interrupts and removed tuples. Tables additionally report frozen tuples,
pages marked all-frozen / all-visible in the visibility map, number of
scanned and removed pages, number of index passes, etc. Indexes report
freed pages.
The least obvious part of the implementation is subtracting index
statistics from the table statistics. This is the bit worth
highlighting. The thing is that indexes are vacuumed before the heap,
and the buffer and WAL statistics that we capture at the heap level by
the end of the heap vacuum already include everything that was spent on
the indexes. If we simply expose the diff of pgBufferUsage/pgWalUsage
between start and end, the table ends up with double-counted pages/WAL:
once in its own report, and a second time inside the reports of its
indexes. This is especially noticeable with parallel index vacuum:
workers accumulate their usage in the leader only after they finish, so
without subtraction the heap report would receive the combined cost of
all workers as a "bonus".
To handle this, as each index finishes vacuuming, its counters are
accumulated into the state of the current operation, and at the moment
the heap report is built these sums are subtracted out. As a result, the
extension receives clean numbers: "this is what was actually spent on
the table itself", and separately "this is what was actually spent on
each index". The behaviour is idempotent for both serial and parallel
vacuum.
The ext_vacuum_statistics extension
The extension registers the hook handler and stores the received data
through the pgstat custom statistics infrastructure. That is, vacuum
counters are kept not in the extension's own files, but together with
the regular cumulative statistics - they survive a restart and are reset
together with pg_stat_reset_*. Access is provided through three views:
one for tables, one for indexes, and one with the per-database aggregate.
Filtering
This is where the main flexibility lives - the extension does not force
"collect everything", but lets you choose both what to track and which
metrics to keep.
By object type. You can limit collection to databases only (without
per-table detail), to tables only, or collect both. Among tables, you
can additionally filter system / user / all.
By an explicit list. An alternative to "by type" is a whitelist: you
turn the corresponding mode on, and the extension starts collecting
statistics only for the databases and tables that were explicitly
registered via add_track_database / add_track_relation (with matching
remove_* for removal). When the lists are off, the type filter is in
effect; when they are on, only the list applies. This is convenient when
you are interested in monitoring specific "hot" tables and do not want
to spend memory on statistics for everything else.
This list is persisted to disk, and there is one more non-trivial part
here. List changes are concurrent - multiple sessions may call
add_track_* simultaneously, plus there is an object-access hook that
cleans the entry on DROP. To avoid ending up with a torn file, access to
the list is serialized via a dedicated LWLock tranche (requested from a
shmem_request_hook), and the file itself is written atomically: first
into a temporary file, then fflush + pg_fsync + durable_rename. All I/O
return codes are checked; on error the temporary file is removed and the
real one is left untouched; PG_TRY/PG_CATCH guarantees cleanup on
ereport(ERROR). Reading the list takes the same lock in shared mode, so
a concurrent write cannot tear the load.
By metric category. There is also a GUC that takes a list and turns on
the categories of interest - buffers, WAL, general counters, timings (or
all). Unwanted categories are simply skipped on the hook handler side
and never make it into the pgstat entry, which reduces the overhead of
the handler itself. This is useful when, for example, only timings are
needed - in that case the extension does not waste time copying the
buffer and WAL fields.
Privileges. The add_track_* / remove_track_* functions require superuser
or pg_read_all_stats. At the SQL level, EXECUTE is revoked from PUBLIC
and granted only to pg_read_all_stats, so a regular user has no access
to mutating the list. The views are unrestricted, like regular statistics.
What is in the patches
0002-Machinery-for-grabbing-extended-vacuum-statistics.patch - the
machinery in the core plus the hook.
0003-ext_vacuum_statistics-...patch - the extension itself, filtering,
views, tests.
-----------
Best regards,
Alena Rybakina
Yandex Cloud
| Attachment | Content-Type | Size |
|---|---|---|
| v38-0001-Track-table-VM-stability.patch | text/plain | 21.7 KB |
| v38-0002-Machinery-for-grabbing-extended-vacuum-statistics.patch | text/plain | 25.0 KB |
| v38-0003-ext_vacuum_statistics-extension-for-extended-vacuum-.patch | text/plain | 145.2 KB |
| From | Date | Subject | |
|---|---|---|---|
| Next Message | shveta malik | 2026-04-28 03:59:39 | Re: Fix race condition in pg_get_publication_tables with concurrent DROP TABLE |
| Previous Message | Peter Smith | 2026-04-28 01:57:00 | Re: Redundant/mis-use of _(x) gettext macro? |