Quick Links

Re: Improve pg_stat_statements scalability

From:	Sami Imseih <samimseih(at)gmail(dot)com>
To:	Michael Paquier <michael(at)paquier(dot)xyz>
Cc:	Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>, lukas(at)fittl(dot)com, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Improve pg_stat_statements scalability
Date:	2026-06-26 22:33:28
Message-ID:	CAA5RZ0uyiH5Ke-AFqSn9R-g4wAQq7b4TbOBtNWKNZrfz+XqoRw@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

Hi,

Sorry for the long silence, but I have been spending time on
this and would like to share my updates.

> > On Wed, Jun 03, 2026 at 05:10:39PM +0900, Kyotaro Horiguchi wrote:
> > > One alternative would be to introduce a separate callback for anytime
> > > flushes. However, since transaction-boundary flushes ultimately need
> > > to flush everything anyway, it seems to me that passing a context flag
> > > would be sufficient.
> >
> > Nah. I am not really in favor of an extra callback. I feel that this
> > could lead to more mistakes when implementing new stats kinds, so
> > doing your approach of using a value based on the context of the flush
> > works for me.

> I will proceed with this approach.

After further thought, I don't think passing a context flag to the
callback is necessary here.

0001 makes pg_stat_force_next_flush() a single entry point for force
flushing. If called inside a transaction, it flushes whatever can
safely be flushed at that point.

The only restriction is in pgstat_relation_flush_cb(), where transactional
counters cannot be flushed while there is active transaction state
for that entry (lstats->trans != NULL). That state is already directly
available to the callback, so I don't see what value passing a context
that most callbacks will not have a need for will add?

> > I like this better, especially for kinds with unique access
> > patterns like pg_stat_statements, where the view function and
> > eviction need to iterate over all their own entries frequently.
> > A dedicated hash eliminates the full-scan problem without a
> > secondary coordination dshash.

> I'm finding this alternative quite appealing, indeed. For PGSS, I was
> wondering about how it would be possible to avoid a secondary hash,
> and I did not consider an approach where we could bypass the main one.

0002 does this. own_hash is the name I chose for the option, but maybe
there is something better. This can only be defined for variable-numbered
stats. There is an API used by core and available to extensions,
pgstat_get_hash_for_kind, so they can retrieve the hash which is
needed for passing to dshash_* related calls.

Also, I realized that we need an API to drop a stats entry during an
in-progress dshash scan. The existing APIs (pgstat_drop_entry()
and pgstat_drop_matching_entries()) cannot be used for this because
they initiate their own dshash lookup or scan internally, which
conflicts with a scan the caller already has open. pgstat_drop_current()
operates directly on the entry the caller already holds from dshash_seq_next().

Also, pgstat_has_pending() lets the caller check whether an entry
has unflushed local data without doing a dshash lookup, which would
also conflict with the open scan.

This will be needed to implement a pg_stat_statements_reset() because
we must be able to scan through the kind's dshash and delete directly
and without undoing a pending change. But I also think there are
other places where this will be needed depending on the final
design of eviction. Attached 0003 does this.

There is also a piece of infrastructure missing that I have not yet
included. Currently, setting pg_stat_statements.save to false prevents
auxiliary data (query text, etc.) from being saved to disk, but does
not prevent the stat entries themselves from being written to the core
stats file, since .write_to_file is set at kind registration time and
cannot change.

I think we need a way for a kind to override .write_to_file at
runtime. PgStat_KindInfo must be a defined as a const.
One option would be a callback to override .write_to_file, which
an extension can then perhaps use to make more granular decisions
about which stats to serialize to the core stats file. Another
idea is an override options API, but I feel better about the
callback.

Now, as far as eviction goes. I have spent time with the help of
my benchmarking suite [1] looking into various strategies. My main
finding is that performing parallel evictions as discussed and
proposed in v3 will not work. In the high churn case, we will
not be able to put enough back pressure to keep the growth
of the dshash in check.

So, the idea that I have now is what I am attaching in 0004.

The important design points are:

1. We need to hard-cap growth of the dshash, so we cannot avoid
skipping new entries if we hit the cap. Right now the hard-cap
is pgss_max * 2.

2. We need to give new entries time to live before they become
eviction candidates. This helps in low-churn-but-some-churn cases.

3. Least-recently-used will be the eviction strategy, with a
threshold to keep new entries alive long enough to build up
their popularity before becoming candidates.

0004 is the work-in-progress idea for now. It uses a LWLock to
serialize eviction and new entries, but it also does eviction
throttling. The numbers I have on the benchmarks I have done
show really good numbers/scalability and I will be running
more benchmarks across all the workloads I have in [1] and
will share the results.

I also have some code on my machine that implements something
inspired by Redis LFU [2], in which instead of a single process
performing an eviction, the processes are constantly sampling
and updating eviction threshold and evictions are done in-line.
This requires a small ring buffer which we track new entries
( they are also in dshash ) and if they do not gain enough
popularity, they become candidates from removal from the dshash and
room is made for the next new entry. Not exactly the Redis
implementation, but something like that that can work for
pg_stat_statements. The bonus points for this is it can be done
without a global lock as evictions are in line. I will also include
this in my benchmarks.

I think the infrastructure patches are going to be needed regardless
of which direction to take, so I would like to see those get
committed while we work out the eviction details.

[1] https://github.com/samimseih/pg_stat_statements_benchmark
[2] https://github.com/redis/redis/blob/unstable/src/evict.c

--
Sami Imseih
Amazon Web Services (AWS)

Attachment	Content-Type	Size
v4-0001-pgstat-Allow-pg_stat_force_next_flush-to-work-in-.patch	application/octet-stream	15.4 KB
v4-0002-pgstat-allow-a-stats-kind-to-use-its-own-dedicate.patch	application/octet-stream	36.3 KB
v4-0003-pgstat-add-pgstat_drop_current-for-use-in-dshash-.patch	application/octet-stream	3.5 KB
v4-0004-WIP-pg_stat_statements-modernize-entry-storage-wi.patch	application/octet-stream	130.4 KB

In response to

Re: Improve pg_stat_statements scalability at 2026-06-24 04:17:55 from Michael Paquier

Responses

Re: Improve pg_stat_statements scalability at 2026-06-26 23:19:56 from Sami Imseih

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Sami Imseih	2026-06-26 23:19:56	Re: Improve pg_stat_statements scalability
Previous Message	Bharath Rupireddy	2026-06-26 22:18:25	Re: enhance wraparound warnings