Re: pg_stat_io_histogram

From: Jakub Wartak <jakub(dot)wartak(at)enterprisedb(dot)com>
To: Andres Freund <andres(at)anarazel(dot)de>
Cc: Ants Aasma <ants(dot)aasma(at)cybertec(dot)at>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: pg_stat_io_histogram
Date: 2026-02-24 14:04:18
Message-ID: CAKZiRmzpBAv7JucezeZgg5cbpprx_=K6XWtQ=NLWJVKMqS_d0w@mail.gmail.com
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Mon, Feb 23, 2026 at 1:35 PM Jakub Wartak
<jakub(dot)wartak(at)enterprisedb(dot)com> wrote:
>
> On Thu, Feb 19, 2026 at 7:12 PM Andres Freund <andres(at)anarazel(dot)de> wrote:
> >
> > Hi,
> >
> > On 2026-02-19 19:55:06 +0200, Ants Aasma wrote:
> > > > Right now the lowest bucket is for 0-8 ms, the second for 8-16, the third for
> > > > 16-32. I.e. the first bucket is the same width as the second. Is that
> > > > intentional?
> > >
> > > If the boundaries are not on power-of-2 calculating the correct bucket
> > > would take a bit longer.
> >
> > Powers of two make sense, my point was that the lowest bucket and the next
> > smallest one are *not* sized in a powers of two fashion, unless I miss
> > something?
>
> Yes, as stated earlier it's intentionally made flat at the beggining to be able
> to differentiate those fast accesses.
>
> > > For reducing the number of buckets one option is to use log base-4 buckets
> > > instead of base-2.
> >
> > Yea, that could make sense, although it'd be somewhat sad to lose that much
> > precision.
>
> Same here, as stated earlier I wouldn't like to loose this precision.
>
> > > But if we are worried about the size, then reducing the number of histograms
> > > kept would be better.
> >
> > I think we may want both.
>
> +1.
>
> > > Many of the combinations are not used at all
>
> This!
>
> > Yea, and for many of the operations we will never measure time and thus will
> > never have anything to fill the histogram with.
> >
> > Perhaps we need to do something like have an array of histogram IDs and then a
> > smaller number of histograms without the same indexing. That implies more
> > indirection, but I think that may be acceptable - the overhead of reading a
> > page are high enough that it's probably fine, whereas a lot more indirection
> > for something like a buffer hit is a different story.
>
> OK so the previous options from the thread are:
> a) we might use uint32 instead of uint64 and deal with overflows
> b) we might filter some out of in order to save some memory. Trouble would be
> which ones to eliminate... and would e.g. 2x saving be enough?
> c) we leave it as it is (accept the simple code/optimal code and waste
> this ~0.5MB
> pgstat.stat)
> d) the above - but I hardly understood how it would look like at all
> e) eliminate some precision (via log4?) or column (like context/) - IMHO we
> would waste too much precision or orginal goals with this.
>
> So I'm kind of lost how to progress this, because now I - as previously stated -
> I do not understand this challenge with memory saving and do now know the aim
> or where to stop this optimization, thus I'm mostly +1 for "c", unless somebody
> Enlighten me, please ;)
>
> > > and for normal use being able to distinguish latency profiles between so
> > > many different categories is not that useful.
> >
> > I'm not that convinced by that though. It's pretty useful to separate out the
> > IO latency for something like vacuuming, COPY and normal use of a
> > relation. They will often have very different latency profiles.
>
> +1
>
> --
>
> Anyway, I'm attaching v6 - no serious changes, just cleaning:
>
> 1. Removed dead ifdefed code (finding most siginificant bits) as testing by Ants
> showed that CLZ has literally zero overhead.
> 2. Rebased and fixed some missing include for ports/bits header for
> pg_leading_zero_bits64(), dunno why it didnt complain earlier.
> 3. Added Ants as reviewer.
> 4. Fixed one comment refering to wrong function (nearby enum hist_io_stat_col).
> 5. Added one typedef to src/tools/pgindent/typedefs.list.
>

I think I have found another way how to minimize the weight of that memory
allocation simply remapping sparse backend type IDs to contiguous ones:

0. So the orginal patch weights like below according to pahole:

struct PgStat_BktypeIO {
[..]
uint64 hist_time_buckets[3][5][8][16]; /* 2880 15360 */
/* size: 18240, cachelines: 285, members: 4 */
};

struct PgStat_IO {
[..]
PgStat_BktypeIO stats[18]; /* 8 328320 */
/* size: 328328, cachelines: 5131, members: 2 */
/* last cacheline: 8 bytes */
};

so 320kB total and not 0.5MB for a start.

1. I've noticed that we were already skipping 4 out of 17 (~23%) of backend
types (thanks to pgstat_tracks_io_bktype()), and with simple array
condensation of backendtype (attached dirty PoC) I can get this down to:

struct PgStat_IO {
[..]
PgStat_BktypeIO stats[14]; /* 8 255360 */
/* size: 255368, cachelines: 3991, members: 2 */
/* last cacheline: 8 bytes */
};

so the attached crude patch is mainly about remapping using
pgstat_remap_condensed_bktype(). Patch needs lots of work, but
demonstrates a point.

2. We could slightly reduce even further if necessary, by also ignorning
B_AUTOVAC_LAUNCHER and B_STANDALONE_BACKEND for pg_stat_io. I mean those
seem to not generating any I/O and yet pgstat_tracks_io_bktype says
yes to them.

Thoughts? Is that a good direction? Would 1 or 2 be enough?

-J.

Attachment Content-Type Size
poc_apply_on_earlier_v6_reduce_memfootprint.txt text/plain 5.3 KB

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Sami Imseih 2026-02-24 14:19:55 Re: Proposal: ANALYZE (MODIFIED_STATS) using autoanalyze thresholds
Previous Message Nazir Bilal Yavuz 2026-02-24 13:57:21 Re: Speed up COPY FROM text/CSV parsing using SIMD