Quick Links

Re: Improve pg_stat_statements scalability

From:	Sami Imseih <samimseih(at)gmail(dot)com>
To:	Lukas Fittl <lukas(at)fittl(dot)com>
Cc:	pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Improve pg_stat_statements scalability
Date:	2026-05-30 02:15:31
Message-ID:	CAA5RZ0uoxiQ2_=xHGRnyc4WdM9aR0fzdMhBubnw97po==--yGQ@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

>> I've created a new wiki page combining the prior 2025 discussion, and
>> notes from today:
>>
>> https://wiki.postgresql.org/wiki/Scalability_of_pg_stat_statements

>Thank you, Lukas!

Thanks for the feedback on v2, and for the productive unconference
discussion in PGConf.dev. I've been iterating on the patches/benchmarks
and wanted to post an updated series that addresses the major design
points raised.

Addressing the feedback from the unconference session:

1/ Eviction

The concern in the design presented in which entries are skipped until
a single threaded eviction completed was that on a steady workload with
max=5000, the 5001st unique query would always be immediately evicted.
This was considered unacceptable, and rightly so. It may be fine for
high churn, pathological cases, but for slow churn cases it would be
unacceptable.

Andres mentioned that eviction should occur in parallel, which I
understood as multiple backends should evict a subset of the entries in
the dshash concurrently. To do this, I implemented a parallel
clock-sweep in which eviction sweeps a partition decrementing
refcounts, and only entries that have decayed to zero are evicted. A
genuinely new query that just arrived has a fresh refcount and survives
the sweep, and if it becomes popular in the meantime, it will survive
for longer. So the algorithm wants an entry to prove itself to remain
considered "hot", else it's swept.

There's a 5% headroom using the already existing
USAGE_DEALLOC_PERCENT in which we evict until count (across all
partitions) drops to 95% of max, so there's always room for new
arrivals before the next sweep.

I chose not to use a background worker for eviction, as discussed
earlier, based on the consensus that backpressure is important, and a
background worker being asynchronous in nature will not provide that.
Also, it could be complicated where a background worker cannot be spun
up for whatever reason.

However, to implement parallel eviction as described above, I needed 2
core changes:

- pgstat_drop_entry must be able to optionally tolerate a dropped
entry, as there could be a delay between the time an entry is marked
as dropped and garbage collection, and within that time multiple
evictions may attempt to drop the same entry.

- Implement a seq scan API for dshash that scans a specific partition.
This will allow eviction to cycle through partitions, i.e.
clock-sweep.

The eviction does not attempt to make room in the bucket for an entry
that triggered it. The main point is to keep forward progress in
making room. We don't need to be more strict here. Also the fact that
pg_stat_statements.max is dynamic means a user can increase this value
to manage high churn without a restart.

The dealloc counter in pg_stat_statements_info now counts individual
entries evicted rather than the number of times eviction was invoked.
I think this is more useful, but it does change the semantics between
versions. With the new design, a single eviction pass can remove many
entries across a partition, so counting evictions no longer tells
you much. Open to other thoughts here on this.

Profiling revealed that pgstat_request_entry_refs_gc(), which is the
standard pattern for less frequently removed entries in other places,
was too expensive when called on every entry drop or every partition
sweep. Now it is only called once per full rotation across all
partitions, which showed much better results in benchmarks.

2/ DSA only query text storage

Rather than a "performance cliff" that Lukas mentioned above when we
switch between DSA and disk for query text storage, the consensus is
to just store all query text in memory. Andres made a point that even
now, transient memory usage for loading the query text for the purpose
of garbage-collection or reading the query stats in
pg_stat_statements means that a user's machine must have enough memory
to handle this. So, why not just throw all query text in memory. The
memory is capped by a new GUC pg_stat_statements.query_text_memory
(default of 4MB but up for discussion).

It is also possible to store more entries due to .max than
.query_text_memory can support, so empty query text columns could be
possible. In the case a user increases .query_text_memory if they
observe empty query columns, the next time an entry is touched, it
will backfill the query text and normalize the string if it can. The
last part could be improved and actually guaranteed if we make
JumbleState available to all hooks (I did not work on this part, but
open for discussion).

Some other comments from Lukas's earlier review:

> It appears you've moved the equivalent of the "if (!entry)" check
> into the pgss_store_query_text function, and we now unconditionally
> call generate_normalized_query.

Fixed.

v3 series is now 5 patches:

0001: pgstat: Introduce pg_stat_report_anytime()

Nothing changes from v2.

0002: pgstat: tolerate already-dropped entries in pgstat_drop_entry()

Required for the parallel eviction design. With multiple backends
sweeping different partitions concurrently, the same entry can be
targeted for drop more than once before garbage collection runs.
A skip_dropped flag makes this safe rather than throwing ERROR.

0003: dshash: add partition-scoped sequential scan

Adds dshash_seq_init_partition() to restrict a scan to a single
partition. This is the building block for per-partition clock-sweep;
a backend only locks and sweeps one partition at a time.

0004: pg_stat_statements: modernize entry storage with pgstat kind

The main patch. Replaces ShmemInitHash with dshash via DSM registry,
and replaces per-entry spinlock counter updates with a custom pgstat
kind that uses the core pgstat infrastructure.

Eviction changes from qsort-all-entries to clock-sweep with an atomic
rotating hand. Each entry carries a refcount (capped at 10) that
decays on sweep; entries reaching zero are evicted. Hot queries keep
their refcount topped up proportionally to access frequency.

pg_stat_statements.max becomes PGC_SIGHUP.

0005: pg_stat_statements: store query text in DSA instead of file

Moves query text from pgss_query_texts.stat into a DSA area via
GetNamedDSA. Adds pg_stat_statements.query_text_memory (PGC_SIGHUP,
default 4MB) controlling DSA size. Eliminates the GC machinery
entirely. When DSA is exhausted, entries are still tracked but query
text is stored as NULL. A backfill mechanism recovers text on
subsequent executions once space becomes available.

Benchmark:

Attached are the benchmark scripts I used for v3 (and will keep
using going forward) with the results in the benchmark_v3.txt file comparing
patch vs upstream.

The benchmark performs various workloads: "high churn", "light churn",
"multi stmt", and simple "select1". I am also tracking query retention
(hot/cold entry retention) to verify the clock-sweep behaves as
expected. I attached the .sql scripts used to benchmark.

The select1 result shows a 1.0% regression. This workload has no
contention, so it purely measures pgstat infrastructure overhead;
perf profiling shows pgstat_get_entry_ref() at the top. However, on
machines with higher core count the upstream spinlock on the counters
becomes a bottleneck, which is where the dshash design should win
back this overhead and maybe more. I still plan to benchmark this on
a larger machine.

You will notice that cold_calls in the patched churn case are much
lower (805 vs 4,458). This is because entries get evicted sooner
under per-partition sweep. Hot and cold query retention lines up w
ith current upstream (1000/1000 hot entries survive continuous churn).

The deallocs count is much higher in the patch (11.9M vs 38.5K in
high churn). per-partition sweep fires frequently to keep the table
at target capacity, whereas upstream batches fewer, larger deallocations.
We can maybe look into reducing USAGE_DEALLOC_PERCENT to
increase retention of "colder" entries.

We also see much less LWLock contention in the patched churn case.
The top wait is PgStatsDSA (502 total) vs pg_stat_statements (7,757)
in upstream, a 15x reduction.

Looking forward to your feedback!

--
Sami Imseih
Amazon Web Services (AWS)

Attachment	Content-Type	Size
benchmark_v3.txt	text/plain	3.9 KB
bench_select1.sql	application/sql	10 bytes
bench_churn.sql	application/sql	202 bytes
bench_light_churn.sql	application/sql	200 bytes
bench_multi_stmt.sql	application/sql	176 bytes
v3-0001-pgstat-Introduce-pg_stat_report_anytime-for-mid-t.patch	application/x-patch	28.5 KB
v3-0002-pgstat-tolerate-already-dropped-entries-in-pgstat.patch	application/x-patch	7.4 KB
v3-0003-dshash-add-partition-scoped-sequential-scan.patch	application/x-patch	5.3 KB
v3-0005-pg_stat_statements-store-query-text-in-DSA-instea.patch	application/x-patch	37.0 KB
v3-0004-pg_stat_statements-modernize-entry-storage-with-p.patch	application/x-patch	117.1 KB

In response to

Re: Improve pg_stat_statements scalability at 2026-05-24 01:29:26 from Sami Imseih

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Scott Ray	2026-05-30 02:31:27	Re: Report oldest xmin source when autovacuum cannot remove tuples
Previous Message	Chao Li	2026-05-30 02:00:24	Re: Fix race during concurrent logical decoding activation