Quick Links

Re: Improve pg_stat_statements scalability

From:	Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>
To:	samimseih(at)gmail(dot)com
Cc:	lukas(at)fittl(dot)com, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Improve pg_stat_statements scalability
Date:	2026-06-01 04:58:58
Message-ID:	20260601.135858.1116584574478485492.horikyota.ntt@gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

Hello.

I only reviewed v3-0001 so far, and the comments below are limited to
that patch. Some of them are higher-level observations rather than
comments on individual code fragments.

+Datum
+pg_stat_report_anytime(PG_FUNCTION_ARGS)

The name pg_stat_report_anytime() feels a bit confusing to me.

While "report" makes sense from an implementation perspective because
the function ultimately reports local statistics to the shared
statistics subsystem, that meaning is not very obvious from the
perspective of a SQL-visible function.

Likewise, I'm not sure that "report_anytime" clearly conveys what the
function actually does. Although this is not intended for general
users, we already have pg_stat_force_next_flush(), and it seems
preferable to follow a similar naming convention.

With that in mind, perhaps something like
pg_stat_flush_anytime_stats() would be more descriptive.

+void
+pgstat_report_anytime_stat(void)

In the previous thread, I understood that there was discussion about
introducing a flush interval of around one second, and that the
direction later shifted toward reusing PGSTAT_MIN_INTERVAL.

While the introduction of a manually triggered API changes when
flushes are initiated, it does not seem to change the need for
limiting the flush frequency itself.

As far as I can see, the current implementation does not apply any
throttling to calls of this API. In theory, a large number of backends
could invoke it frequently and generate a high flush rate. For
example, if 1000 backends were to call it once per second, that would
result in 1000 shared-stats updates per second.

Whether such a usage pattern would occur in practice is a separate
question. However, given that existing pgstat code uses
PGSTAT_MIN_INTERVAL to limit flush frequency, it seems to me that
anytime stats should probably retain a similar restriction.

- Assert(did_flush || nowait);
+ /*
+ * When nowait is false we block for the lock, so the only reason a
+ * flush_pending_cb can legitimately return false is that the entry
+ * has active transaction state that must not be freed yet (e.g.
+ * relation stats with trans != NULL). That situation only arises
+ * mid-transaction, hence the IsTransactionOrTransactionBlock() check.
+ */
+ Assert(did_flush || nowait || IsTransactionOrTransactionBlock());

Given the current design, the assertion itself looks reasonable to me,
since we're now allowing flush_pending_cb() to return false during
mid-transaction flushes. However, I find the comment a bit difficult
to reconcile with the actual condition. The comment explains a
specific reason why flush_pending_cb() may legitimately return false,
while the assertion itself only distinguishes between transaction and
non-transaction contexts.

As I understand it, the intent here is to keep the existing
requirement outside a transaction while relaxing it for
mid-transaction flushes. If so, I wonder whether expressing the
assertion as

Assert(IsTransactionOrTransactionBlock() ||
(did_flush || nowait));

would make that intent a bit more obvious.

If that's indeed the intent, perhaps the comment could also be phrased
in terms of "the assertion is relaxed during a transaction" rather
than describing a specific reason why flush_pending_cb() may return
false.

* Ignore entries that didn't accumulate any actual counts, such as
- * indexes that were opened by the planner but not used.
+ * indexes that were opened by the planner but not used. The entry cannot
+ * be freed if there is active transaction state, since
+ * AtEOXact_PgStat_Relations will still merge counters into it.
*/
if (pg_memory_is_all_zeros(&lstats->counts,
sizeof(struct PgStat_TableCounts)))
- return true;
+ return (lstats->trans == NULL);

I may be missing something, but I think there may be a more
fundamental issue behind this change.

Historically, flushing and freeing an entry were effectively the same
decision because flushing only happened after transaction end. With
anytime flushes, however, those become separate concerns. A callback
may be able to flush everything that is appropriate for the current
flush context, while the caller may still need to keep the entry
around for later transaction-end processing.

That makes it hard to reason about checks such as
pg_memory_is_all_zeros(&lstats->counts, ...). This check still appears
to tell whether the counters themselves are zero, but the proposed
change makes it look as if that is no longer enough to determine whether
the entry is "empty", because entry lifetime is also folded into the
callback result.

I wonder whether it would be cleaner for the caller to make the lifetime
decision, based on the kind of flush being performed, and for the
callback to be told that flush context explicitly. For example, the
callback could be passed whether this is an anytime flush or a
transaction-boundary flush, flush the counters that are appropriate for
that context, and then return whether any counters remain.

That way, the callback would not need to infer entry lifetime from
transaction state, and the return value could keep a simpler meaning.

Regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

In response to

Re: Improve pg_stat_statements scalability at 2026-05-30 02:15:31 from Sami Imseih

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Michael Paquier	2026-06-01 05:17:16	Re: [PATCH] Remove obsolete tupDesc assignment in extended statistics
Previous Message	Michael Paquier	2026-06-01 04:37:34	Re: Add wait events for server logging destination writes