From: | Naga Appani <nagnrik(at)gmail(dot)com> |
---|---|
To: | Ashutosh Bapat <ashutosh(dot)bapat(dot)oss(at)gmail(dot)com> |
Cc: | torikoshia <torikoshia(at)oss(dot)nttdata(dot)com>, Michael Paquier <michael(at)paquier(dot)xyz>, Kirill Reshke <reshkekirill(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org |
Subject: | Re: [Proposal] Expose internal MultiXact member count function for efficient monitoring |
Date: | 2025-09-03 21:10:56 |
Message-ID: | CA+QeY+DLsjj4RxY-Z2wRC-sYCeFxuPSd8zSkt4Nvs==YV6GP6w@mail.gmail.com |
Views: | Whole Thread | Raw Message | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
Hi Atsushi and Ashutosh,
Thank you for reviewing the patch. Attached is v7, incorporating the
feedback. Please see my responses in-line below.
On Fri, Aug 22, 2025 at 6:45 AM Ashutosh Bapat
<ashutosh(dot)bapat(dot)oss(at)gmail(dot)com> wrote:
>
> On Fri, Aug 22, 2025 at 7:37 AM torikoshia <torikoshia(at)oss(dot)nttdata(dot)com> wrote:
> >
> > | Now that pg_get_multixact_stats() exposes num_members, the HEAD branch
> > | docs can describe the thresholds in terms of counts directly.
> >
> > Personally, I think it might be fine to keep the gigabyte-based
> > description, and perhaps we could show both the number of members and
> > gigabytes, since it'd be also helpful to have a sense of the scale.
> >
>
> Those who have grown their own utilities to monitor the on-disk usage
> will not be able to use the count based thresholds and might take some
> time for them to starting using pg_get_multixact_stats(). It makes
> sense to mention both the count and the corresponding disk usage
> threshold. Something like "Also, if the number of multixact members
> exceeds approximately 2^31 entries (occupying roughly more than 10GB
> in storage) ... ". Users can choose which threshold they want to use.
> Adding disk storage threshold in parenthesis indicates that the count
> is more accurate and more useful.
Updated docs to include both counts and approximate storage.
> I thought mentioning bytes, a unit, in column name members_bytes would
> not be appropriate in case we start reporting it in a different unit
> like kB in future. But we already have
> pg_stat_replication_slots::spill_bytes with similar naming. So may be
> it's okay. But I would prefer members_size or members_storage or some
> such units-free name.
Good point! Adjusted to a units-free name: members_size.
> + in <literal>pg_multixact/members</literal> directory,
> + <literal>oldest_multixact</literal> is the oldest multixact ID still
> in use, and
> + <literal>oldest_offset</literal> is the oldest member offset still in use.
>
> I am not sure whether oldest_offset is worth exposing. It is an
> implementation detail. Upthread, Michael suggested to expose oldest
> offset from GetMultiXactInfo(), but I don't see him explicitly saying
> that we should expose it through this function as well. Michael what
> do you think?
IMHO, exposing oldest_offset gives a full picture of multixact state.
It complements oldest_multixact,
and including it won’t hurt. That said, if consensus is against it,
I’m happy to drop it.
> + These values can be used to monitor multixact consumption and anticipate
> + autovacuum behavior. See <xref linkend="vacuum-for-multixact-wraparound"/>
> + for further details on multixact wraparound.
>
> I still think that this is not needed. There is no reason to restrict
> how users want to use this function. We usually don't do that unless
> there is a hazard associated with it.
Removed.
> + <para>
> + This is a live snapshot of shared counters; the numbers can change
> between calls,
> + even within the same transaction.
> + </para></entry>
>
> I have not seen the phrase "live snapshot" being used in the
> documentation before. How about "The function reports the statistics
> at the time of invoking the function. They may vary between calls even
> within the same transaction."?
Updated wording.
> + linkend="guc-autovacuum-multixact-freeze-max-age"/>. Also, if the number
> + of members created exceeds approximately 2^31 entries, aggressive vacuum
>
> a member means the transaction participating in a multixact. What you
> intend to say is "if the number of multixacts member entries created
> ...", right?
Correct, updated.
> + <para>
> + The <function>pg_get_multixact_stats()</function> function, described in
>
> unnecessary pair of commas.
Fixed.
> + This output shows a system with significant multixact activity:
> about ~100 million
> + multixact IDs and ~773 million member entries have been created
> since the oldest
> + surviving multixact (ID 39974368). By leveraging this information,
> the function helps:
> + <orderedlist>
> + <listitem>
> ... snip ...
> + Detect potential performance impacts before they become critical.
> + For instance, high multixact usage from frequent row-level locking or
> + foreign key operations can lead to increased I/O and CPU overhead during
> + vacuum operations. Monitoring these stats helps tune autovacuum frequency
> + and transaction patterns.
> + </simpara>
> + </listitem>
> + </orderedlist>
>
> I am unsure whether we should be mentioning use cases in such detail.
> Users may find other ways to use those counts. I think the following
> paragraph should be placed here.
>
> + These values can be used to monitor multixact consumption and anticipate
> + autovacuum behavior. See <xref linkend="vacuum-for-multixact-wraparound"/>
> + for further details on multixact wraparound.
>
> But others may have different opinions.
>
> Maybe you could further write in your example that an aggressive
> autovacuum will be triggered in another 10 seconds (or so) if the
> number of member entries continues to double every 5 seconds. Or some
> practical "usage example" like that.
Ack. I believe keeping the example with a short list is helpful for
users to navigate
and interpret the stats. If preferred, I can trim it to a brief
paragraph with just the query in
the next rev.
> + * Returns statistics about current MultiXact usage:
> + * - num_mxids: Number of MultiXact IDs in use
> + * - num_members: Total number of member entries
> + * - oldest_multixact: Oldest MultiXact ID still needed
> + * - oldest_offset: Oldest offset still in use
>
> We don't need to mention each column here, it's evident from the
> function body and also from the user facing documentation. Just the
> first line is ok.
Updated - kept only the high-level description.
> + *
> + * Returns a row of NULLs if the MultiXact system is not yet initialized.
>
> tuple or record instead of row.
>
> In the earlier patch you were calling PG_RETURN_NULL(), which I
> thought was better. It would get converted into a record of NULLs if
> someone is to do SELECT * FROM pg_get_multixact_stats().
>
> I don't think "the MultiXact system is not yet initialized" is the
> right description of that condition. GetMultiXactInfo() prologue says
> "
> Returns false if unable to determine, the oldest offset being
> unknown." MultiXactStatData has following comment for oldest offset.
> /*
> * Oldest multixact offset that is potentially referenced by a multixact
> * referenced by a relation. We don't always know this value, so there's
> * a flag here to indicate whether or not we currently do.
> */
Switched to PG_RETURN_NULL() and rephrased both code comment and docs.
> + * Calculate approximate storage space:
> + * - Members are stored in groups of 4
> + * - Each group takes 20 bytes (5 bytes per member)
> + * Note: This ignores small page overhead (12 bytes per 8KB)
> + */
> + membersBytes = (int64) members * 5;
>
> Do we have some constant macros or sizeof(some structure) defined for
> 5 and 4? That way this computation will be self maintaining and self
> documenting.
Those macros are already defined in multixact.c - for example,
MULTIXACT_MEMBERS_PER_MEMBERGROUP
and MULTIXACT_MEMBERGROUP_SIZE encode the 4-per-group and 20-byte
layout. They are local today,
and I’m not sure why they were never exposed. Rather than moving them
into a header and creating wider changes,
v7 retains the explicit 5-bytes/member estimate with an explanatory
comment to stay consistent with existing guidance.
If we feel these macros should be promoted to a header, I think that
would be best handled as a small, separate patch,
and I’d be happy to help with that.
> + nulls[0] = nulls[1] = nulls[2] = nulls[3] = nulls[4] = false;
>
> memset(nulls, false, sizeof(nulls)); is better and used everywhere.
>
> In fact, instead of initializing it all to true first and then setting
> all to false here, we should memset here and set it to true in else
> block.
Updated. v7 uses memset(false) and only sets true where needed.
> +++ b/src/test/isolation/specs/multixact_stats.spec
>
> I have not an seen an isolation test being used for testing a stats
> function. But I find it useful. Let's see what others think.
>
> @@ -0,0 +1,127 @@
> +# High-signal invariants for pg_get_multixact_stats()
> +# We create exactly one fresh MultiXact on a brand-new table. While
> it is pinned
> +# by two open transactions, we assert only invariants that background
> VACUUM/FREEZE
> +# cannot violate:
> +# • members increased by ≥ 1 when the second locker arrived,
> +# • num_mxids / num_members did not decrease vs earlier snapshots,
> +# • oldest_* never decreases.
> +# We make NO assertions after releasing locks (freezing/truncation
> may shrink deltas).
> +# NOTE: Snapshots snap0 and subsequent checks are taken inside an open driver
> +# transaction to narrow the window for unrelated truncation between snapshots.
>
> What's a driver transaction?
A driver transaction is simply the controlling session that stays open
while snapshots are taken.
> +#
> +# Terminology (global counters):
> +# num_mxids, num_members : “in-use” deltas derived from global horizons
> +# oldest_multixact, offset : oldest horizons; they move forward, never backward
> +#
> +# All assertions execute while our multixact is pinned by open txns,
> which protects
> +# the truncation horizon (VACUUM can’t advance past our pinned multi).
>
> Probably this comment is not needed. But from the sequence of steps
> executed, the data is collected when multixact is pinned (what does
> that mean?) but the assertions are executed at the end when all the
> transactions are committed. Am I correct?
You are correct — the assertions are executed at the end, after the commits.
The key point is that all snapshots are taken while the multixact is
pinned by open transactions,
so the invariants hold despite the final check happening later.
> +step snap0 {
> + CREATE TEMP TABLE snap0 AS
> + SELECT num_mxids, num_members, oldest_multixact, oldest_offset
> + FROM pg_get_multixact_stats();
> +}
>
> You could use a single table with a primary key column to distinguish
> snaps which can be used for joining the rows. Why use a temporary
> table? Just setup and tear down the snap table as well?
I kept separate temp tables to keep each snapshot isolated and easy to
read in the spec.
A single table with a PK would work too, but I felt temp tables made
the sequence clearer.
> +
> +# Pretty, deterministic key/value output of boolean checks.
> +# Keys:
> ... snip ...
> + (s1.num_mxids >= COALESCE(s0.num_mxids, 0)),
> + (s2.num_mxids >= COALESCE(s1.num_mxids, 0)),
> + (s1.num_members >= COALESCE(s0.num_members, 0)),
> + (s2.num_members >= COALESCE(s1.num_members, 0))
> + ]
>
> This is getting too complex to follow. It produces pretty output but
> the query is complex. Instead just let keys as the columns in the
> query. Maybe you could print expanded output if that's possible in an
> isolation test.
>
I used the labeled key/value array to mimic \x-style readability while
keeping the output
deterministic for isolation’s text diffs. It clearly names each
invariant and avoids
formatter-dependent width/spacing.
Thanks again for the thoughtful reviews. I really appreciate the
guidance and will be glad to adjust
further if needed.
Best regards,
Naga
Attachment | Content-Type | Size |
---|---|---|
v7-0001-Add-pg_get_multixact_stats-function-for-monitorin.patch | application/octet-stream | 18.9 KB |
From | Date | Subject | |
---|---|---|---|
Next Message | Matheus Alcantara | 2025-09-03 21:14:40 | Re: LISTEN/NOTIFY bug: VACUUM sets frozenxid past a xid in async queue |
Previous Message | Matheus Alcantara | 2025-09-03 21:04:47 | Re: LISTEN/NOTIFY bug: VACUUM sets frozenxid past a xid in async queue |