Re: pg_stat_io_histogram

From: Jakub Wartak <jakub(dot)wartak(at)enterprisedb(dot)com>
To: Tomas Vondra <tomas(at)vondra(dot)me>
Cc: Andres Freund <andres(at)anarazel(dot)de>, Ants Aasma <ants(dot)aasma(at)cybertec(dot)at>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: pg_stat_io_histogram
Date: 2026-03-17 12:13:59
Message-ID: CAKZiRmy8qMeFprFqm0cyQQXtBRbP2_3TVNwxYOeedD-Yro-P=Q@mail.gmail.com
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Mon, Mar 16, 2026 at 11:50 PM Tomas Vondra <tomas(at)vondra(dot)me> wrote:

Hi Tomas,

thanks for reviewing! New version attached addressing the need of rebase,
some review concerns, I've still left unsquashed ideas for memory use reduction
to allow easier reference points.

> I've looked at this patch today, as it's tangentially related to the
> patch adding prefetch stats to EXPLAIN. And earlier WIP version of the
> EXPLAIN patch included a couple histograms (not for timings, but for IO
> sizes, prefetch distances, etc.). And during development it was quite
> useful. We decided to keep EXPLAIN simpler, but having that at least in
> a pg_stat_io_histogram view would be nice. So +1 to this.
>
> I went through the thread today. A lot has been discussed, most issues
> seem to have been solved, but what are the outstanding ones? I think it
> would be helpful if someone involved (=Jakub) could write a brief
> summary, listing the various open questions.

I think the feature is fine as it is, however there may be 3 potentially
unsolved topics:

1. Concerns about memory use. With v7 I had couple of ideas, and with those
the memory use is really minimized as long as the code is still simple
(so nothing fancy, just some ideas to trim stuff and dynamically allocate
memory). I hope those reduce memory footprint to acceptable levels, see my
earlier description for v7.

2. There were concerns about time conversion overhead. I think benchmarks
right now prove that right now we do not have any problems.

3. Probably open thing is bucket width (AKA how much we want to make it a
tool for spotting outliers vs how much we want to make it tool for a
IO performance analysis).

> Now let me add a couple open questions of my own ;-)
>
> My understanding is that the performance (i.e. in terms of CPU) is fine.
> I'm running some tests on my own, both to check the behavior and to
> learn about it. And so far the results show no change in performance,
> it's all within 1% of master. So that's fine.

Right, so far multiple tests shown that CPU impact is negligible thanks
to using that simple multiple dimensons array...

> memory usage
> ------------
> AFAICS the primary outstanding issue seems to be how to represent the
> histograms in each backend, so that it's not wasteful but also not
> overly complex. I'm not sure what's the current situation, and how far
> from acceptable it is.

Correct, the crux of the issue is that if the array used to store the
histograms is not taking too much memory. We would probably need to
hear from Andres if that acceptable or not.

Further memory savvy data structures could used probably be:
- tile-based allocated array
- hash structure
However both of which would cause potentially bigger CPU impact, bigger
complexity and likely they would end up (in primitive implementation)
dynamic allocation of memory in critical areas (where it would blow up),
so that would have to be patched by pre-allocating based on - probably -
backend type and expected I/O there? On negative side, there's also concern
for moving that to PG20 then.

> histogram range
> ---------------
> Another questions is what should be the range of the histogram, and
> whether the buckets should be uint32 or uint64. It's somewhat related to
> the previous question, because smaller histograms need less memory
> (obviously). I think the simpler the better, which means fixed-size
> histograms, with a fixed number of buckets of equal size (e.g. uint64).

+1

> But that implies limited range / precision, so I think we need to decide
> whether we prefer accurate buckets for low of high latencies.
>
> The motivation for adding the histograms was investigating performance
> issues with storage, which involves high latencies. So that would prefer
> better tracking of higher latencies (and accepting lower resolution for
> buckets close to 0). In v7 the first bucket is [0,8) microsecs, which to
> me seems unnecessarily detailed. What if we started with 64us? That'd
> get us to ~1s in the last bucket, and I'd imagine that's enough. We
> could use the last bucket as "latencies above 1s". If you have a lot of
> latencies beyond 1s, you have serious problems.
>
> Yes, you can get 10us with NVMe, and so if everything works OK
> everything will fall into the first bucket. So what? I think it's a
> reasonable trade off. We have to compromise somewhere.

I think some minor use-case that I have tried to cover is answering if
stuff hit buffer cache or the device itself, but I'm free to adjust it
again if there's consensus.

Earlier with Andres we had following exchange in this thread:

> > (..) The current implementation uses fast bucket
> > calculation to avoid overheads and tries to cover most useful range of
> > devices via buckets (128us..256ms, so that covers both NVMe/SSD/HDD and
> > abnormally high latency too as from time to time I'm try to help with I/O
> > stuck for *seconds*, usually a sign of some I/O multipath issues, device
> > resetting, or hypervisor woes).

> Hm. Isn't 128us a pretty high floor for at least reads and writes? On a good
> NVMe disk you'll get < 10us, after all.

so I've interpretted 8us as sweet spot and as v8 stands out max is 128ms
(which to me is high indicator of stuff being broken anyway, and Ants also
wanted to have it much higher):

> I think it would be useful to have a max higher than 131ms. I've seen
> some cases with buggy multipathing driver and self-DDOS'ing networking
> hardware where the problem latencies have been in the 20s - 60s range.
> Being able to attribute the whole time to I/O allows quickly ruling
> out other problems. Seeing a count in 131ms+ bucket is a strong hint,
> seeing a count in 34s-68s bucket is a smoking gun.

but that would again raise the memory consumption even higher.

> Alternatively, we could make the histograms more complex. We could learn
> a thing or two from ddsketch, for example - it can dynamically change
> the range of the histogram, depending on input.
>
> We could also make the buckets variable-sized. The buckets have
> different widths, and assuming uniform distribution will get different
> number of matches - with bucket N getting ~1/2 of bucket (N+1). So it
> could be represented by one fewer bit. But it adds complexity, and IO
> latencies are unlikely to be uniformly distributed.
>
> Alternatively we could use uint32 buckets, as proposed by Andres:
>
> > I guess we could count IO as 4 byte integers, and shift all bucket
> > counts down in the rare case of an on overflow. It's just a 2x
> > improvement, but ...
>
> That'd mean we start sampling the letencies, and only add 50% of them to
> the histogram. And we may need to do that repeatedly, cutting the sample
> rate in 1/2 every time. Which is probably fine for the purpose of this
> view, but it adds complexity, and it means you have to "undo" this when
> displaying the data. Otherwise it'd be impossible to combine or compare
> histograms.
>
> Anyway, what I'm trying to say is that we should keep the histograms
> simple, at least for now.

Yes, my main goal was to stick it to being simple as priority and nearly zero
CPU impact. Secondary tradeoff in my opinion would be some memory use, but
only for those who enable it via GUCs. Any way if I change
PGSTAT_IO_HIST_BUCKETS from 16 to 24 (so 16*8 = 128b [2 cache lines]
to 24*8=196b
[3 cache lines]), I get:
- latency resolution of 8us up to 32s (instead of max 128ms)
- shm 'Shared Memory Stats' increases from 482944 to 575104 bytes
(with track* GUCs)
- no impact for backends not using track*io_timings
- if track* GUCs are set then uint64_t hist_time_buckets becomes [3][5][8][24],
so growth from ~15kB to ~23kB

> other histograms
> ----------------
> As mentioned, the EXPLAIN PoC patch had I/O histograms for I/O sizes,
> in-progress I/Os, etc. I wonder if maybe it'd make sense to have
> something like that in pg_stat_io_histogram too. Ofc, that goes against
> the effort to reduce the memory usage etc.

Hm, you got me thinking, maybe we should rename this pg_stat_io_lat_histogram,
because that way we would make place (in future) for the others like
pg_stat_io_size_histogram, etc.

> a couple minor review comments
> ------------------------------
>
> 1) There seems to be a bug in calculating the buckets in the SRF:
[..]
> Notice the upper boundary is includes the lower boundary of the next
> bucket. It should be [0,8), [8,...). pg_stat_io_histogram_build_tuples
> should probably set "upper.inclusive = false;".

Right, I was blind, fixed.

> 2) This change in monitoring.sgml seems wrong:
>
> <structname>pg_stat_io_histogram</structname> set of views ...
>
> AFAICS it should still say "pg_stat_io set of views", but maybe it
> should mention the pg_stat_io_histogram too.

Fixed, that was wrongly placed as this view has nothing to do with buffer cache
hit ratio.

> 3) pg_leftmost_one_pos64 does this:
>
> #if SIZEOF_LONG == 8
> return 63 - __builtin_clzl(word);
> #elif SIZEOF_LONG_LONG == 8
> return 63 - __builtin_clzll(word);
> #else
>
> Shouldn't pg_leading_zero_bits64 do the same thing?

Windows, cough, should be fixed.

> 4) I'm not sure about this pattern:
>
> PgStat_BktypeIO *bktype_stats = &backends_io_stats->stats[bktype];
> if(bktype == -1)
> continue;
>
> Maybe it won't trigger an out-of-bounds read, it the compiler is smart
> enough to delay the access to when the pointer is really needed. But it
> seems confusing / wrong, and I don't think we do this elsewhere. For
> example the functions in pgstat_io.c do this:
>
> PgStat_BktypeIO *bktype_shstats;
> if (bktype == -1)
> continue;
> bktype_shstats = &pgStatLocal.shmem->io.stats.stats[bktype];

Yes, yes, it was temporary and I was pretty sure I eradicated all code like
this earlier, and I did remove it from pgstat_io.c (fun-fact: sometimes it did
fail on tests with -O0 AFAIK with ubsan/asan, but it does not on -O2). I've
missed this one and have fixed it now in src/backend/utils/adt/pgstatfuncs.c
too, thanks for spotting.

-J.

Attachment Content-Type Size
v8-0002-PendingBackendStats-save-memory.patch text/x-patch 2.4 KB
v8-0005-Condense-PgStat_IO.stats-BACKEND_NUM_TYPES-array-.patch text/x-patch 7.9 KB
v8-0001-Add-pg_stat_io_histogram-view-to-provide-more-det.patch text/x-patch 34.4 KB
v8-0004-Convert-PgStat_IO-to-pointer-to-avoid-huge-static.patch text/x-patch 3.5 KB
v8-0003-PendingIOStats-save-memory.patch text/x-patch 9.0 KB
v8-0006-Further-condense-and-reduce-memory-used-by-pgstat.patch text/x-patch 5.2 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Peter 'PMc' Much 2026-03-17 12:17:45 Need help debugging SIGBUS crashes
Previous Message Xuneng Zhou 2026-03-17 12:13:36 Re: BUG: Cascading standby fails to reconnect after falling back to archive recovery