Re: pg_stat_io_histogram

From: Jakub Wartak <jakub(dot)wartak(at)enterprisedb(dot)com>
To: Andres Freund <andres(at)anarazel(dot)de>
Cc: PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: pg_stat_io_histogram
Date: 2026-01-30 13:43:41
Message-ID: CAKZiRmyLKeh9thmHNbkD7KSy3fsoUeopNVEGH33na8dXS9kN2g@mail.gmail.com
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Thu, Jan 29, 2026 at 5:27 PM Andres Freund <andres(at)anarazel(dot)de> wrote:
>
> Hi,
>
> On 2026-01-28 12:12:10 +0100, Jakub Wartak wrote:
> > On Tue, Jan 27, 2026 at 1:06 PM Jakub Wartak <jakub(dot)wartak(at)enterprisedb(dot)com>
> > > Not yet, I first wanted to hear if I'm not sailing into some plain stupid
> > > direction somewhere with this idea or implementation (e.g.
> > > that INSTR_TIME_GET_MICROSEC() was a really stupid omission from my side).
> > >
> > > I'll try to perform this test overhead measurement hopefully with v3 once
> > > we settle on how to do that bit shifting/clz().
> > >
> >
> > [..]
> > Here's the answer: on properly isolated perf test run (my
> > old&legacy&predictiable
> > 4s32c64t NUMA box, s_b=8GB, DB size 16GB, hugepages, no turboboost, proper
> > warmup,
> > no THP, cpupower D0, no physical I/O, ~22k pread64() calls/sec combined to
> > VFS
> > cache)
> > and started on just using single NUMA: numactl --membind=0
> > --cpunodebind=0
> > measured using: pgbench -M prepared -c 4 -j 4 postgres -T 20 -P 1 -S
> >
> > master+track_io_timings=on, 60s warmup and then 3x runs
> > tps = 44615.603668
> > tps = 44556.191492
> > tps = 44813.793981
> > avg = 44662
> >
> > master+track_io_timings=on+patch, , 60s warmup and then 3x runs
> > tps = 44441.879384
> > tps = 44403.101737
> > tps = 45036.747418
> > avg = 44627
> >
> > so that's like 99.921% (so literally no overhead) and yields picture like:
>
> I don't think that's a particularly useful assurance, unfortunately:
>
> 1) Using pgbench with an in-memory readonly workload is typically limited by
> context switch overhead and per-statement overhead. After a short while you
> have at most one IO per statement (the heap page), which obviously isn't
> going to be affected by a small per-IO overhead.
>
> 2) The per-core memory bandwidth on that old machine, if it's the quite old
> EDB machine I think it is, is so low, that you'd be bottlenecked by memory
> bandwidth well before you're going to be bottlenecked by actual CPU stuff
> (which the bucket computation is).

Hi, thanks for having a look!

That legacy server is mine, but yeah even on same NUMA it can just get
~16GB/s AFAIR and just ~4GB between nodes. I've forgot to reply in
that old NUMA thread
back then, maybe it's not relevant, but I find it valuable often as
the bottlenecks
are easier to hit/notice and there's not that many traps that modern CPUs have
(+ find having 4 NUMAs/socket in 2U is not that easy today, after all
it's physical
box nearby so PMCs are there too - unlike in "cloud", and by having 4 nodes the
disbalances between nodes/zones are much more cleanly visible than
just 1 local vs
1 remote). Somehow I built trust on the results that machine (but I
still can lift it!
so probably still shouldn't trust it fully - pun to the "never trust a
machine you
can lift" :D)

> I think you'd have to test something like pg_prewarm(), with
> io_combine_limit=1, on a modern *client* CPU (client CPUs typically have much
> higher per-core memory bandwidth than the more scalable server CPUs).

Fair point, thanks for explaining in the above chapter. So for much more modern
Intel(R) Ultra 7 (1s16c32t,that can - according to mlc(1) - can do up
to 85-90GB/s
bandwidth and has those ugly P/E-cores, so I've pinned (taskset) the backend
doing pg_prewam to P-Core @ that usually runs @ 5Ghz, but with no_turbo
that's just @ 1.4Ghz). That was on normal build (not debug, so -O2, no casserts,
gcc 13, kernel 6.14.x, no_turbo, s_b=8GB with HP, DB scale 600
(pgbench_accounts @
7.7GB), performance governor, no THP, ...)

I'was kind of surprised, but here it goes full disclosure of my results. With
the patch and track_io_timing=on, io_combine_limit=1, IO_METHOD='SYNC'
just to care
about 1 PID, here are timings of pg_buffercache_evict_all() and then
measured duration of
select pg_prewarm('pgbench_accounts');

So initially I've got this picture, eliminated some worst/best too:

Without patch:
Time: 4644.346 ms (00:04.644)
Time: 4612.610 ms (00:04.613)
Time: 4639.133 ms (00:04.639)
Time: 4625.020 ms (00:04.625)
Time: 4636.652 ms (00:04.637)
Avg: 4631ms

With the patch:
Time: 4765.780 ms (00:04.766)
Time: 4784.308 ms (00:04.784)
Time: 4754.661 ms (00:04.755)
Time: 4770.772 ms (00:04.771)
Time: 4768.232 ms (00:04.768)
Avg: 4768ms (102.95%)

With the patch and __builtin_clzl()
Time: 4750.293 ms (00:04.750)
Time: 4729.288 ms (00:04.729)
Time: 4727.820 ms (00:04.728)
Time: 4729.760 ms (00:04.730)
Time: 4727.146 ms (00:04.727)
Avg: 4732ms (102.18%)

So clearly there was some overhead (I've started getting worried),
and __builtin_clz() was cheaper slightly cheaper or just too much jitter --
yes I've got plenty jittering out with this (not-shown, so above are like 5
results out of 15).

With v2 patch and __builtin_clzl() and default track_io_timing=off (default)
got back to ~4660m as expected.

With v2 patch and __builtin_clzl() and default io_combine_limit=128kB
and track_io_timing=off, went back to ~4150ms:
Time: 4151.942 ms (00:04.152)
Time: 4133.747 ms (00:04.134)
Time: 4153.103 ms (00:04.153)
Time: 4135.199 ms (00:04.135)

With thje patch and __builtin_clzl() and default io_combine_limit=128kB
track_io_timing=on, was also @ ~4150ms.
Time: 4152.941 ms (00:04.153)
Time: 4154.096 ms (00:04.154)
Time: 4155.119 ms (00:04.155)

So with "batching" the IOs, the overhead is almost gone. BTW that's with
current_clocksource says "tsc". After dozens of runs, I've noticed thermals
starting playing a bigger role than this patch, so i've did idle-set -D0
and it degraded even further.

Master, but still got lots of fluctuations, non filtered picture
Time: 5518.546 ms (00:05.519)
Time: 5587.675 ms (00:05.588)
Time: 5512.828 ms (00:05.513)
Time: 5534.023 ms (00:05.534)
Time: 5728.125 ms (00:05.728)
Time: 5731.543 ms (00:05.732)
Time: 5762.687 ms (00:05.763)
Time: 5565.607 ms (00:05.566)
Time: 5498.496 ms (00:05.498)
Time: 5637.870 ms (00:05.638)

but if I leave it *idle* for a while couple minutes then I get:
Time: 5577.879 ms (00:05.578)
Time: 5575.648 ms (00:05.576)
Time: 5548.146 ms (00:05.548)

Some break and with the patch and __builtin_clzl (it gets lower sometimes than
master, how can I trust this?!)
Time: 5504.415 ms (00:05.504)
Time: 5531.827 ms (00:05.532)
Time: 5733.146 ms (00:05.733)
Time: 5511.549 ms (00:05.512)

So something more happening there , probably with thermals/scheduler than with
patch. So of course I've done some home work [1][2], I have found even
Your's rant on
some of this [1] and truth to be told I'm unable to stabilize those
deviations on
this __modern__ client CPU. So i've tried on another much more predictable
(and non-modern :P) client CPU: Intel Core i5 7600k (1s4c4t) and got much more
consistent numbers there (those are non-filtered, almost identical variables
from also same setup(also 6.14.x, same tweaks, also with taskset to 1c) ):

Master:
Time: 2592.351 ms (00:02.592)
Time: 2574.612 ms (00:02.575)
Time: 2592.530 ms (00:02.593)
Time: 2575.356 ms (00:02.575)
Time: 2594.687 ms (00:02.595)
Avg=2585ms

Master+patch:
Time: 2577.610 ms (00:02.578)
Time: 2585.796 ms (00:02.586)
Time: 2568.559 ms (00:02.569)
Time: 2586.199 ms (00:02.586)
Time: 2567.872 ms (00:02.568)
Avg=2576ms (below master?!)

Master+patch__builtin_clzl:
Time: 2578.083 ms (00:02.578)
Time: 2586.732 ms (00:02.587)
Time: 2573.176 ms (00:02.573)
Time: 2592.048 ms (00:02.592)
Time: 2575.731 ms (00:02.576)
Time: 2575.570 ms (00:02.576)
Avg=2579ms (below master?!)

Just Master again:
Time: 2578.838 ms (00:02.579)
Time: 2588.531 ms (00:02.589)
Time: 2572.165 ms (00:02.572)
Time: 2591.528 ms (00:02.592)
Time: 2572.015 ms (00:02.572)
Time: 2589.921 ms (00:02.590)
Time: 2572.124 ms (00:02.572)
Avg=2580ms

So to sum-up:
- it still looks OK to me
- bigger impact than the patches itself can be thermals on modern-day CPUs(?)
- older/legacy CPU (desktop one) seems to be less jittering than modern client
laptop CPU even with the most strict perf. settings (?)
- worst-case: to spot that ~2% regression one would have to disable
the io batching,
enable track_io_timing (that's the not default)

I'm attaching v3 which has now default switched to __builtin_clzl() which
works ok for uint64 (not sure if I need to care about __builtin_clzll
on Windows?).

Open questions:
0. Should I pursue more benchmarking or the above results are enough?
1. __builtin_clzl() or not to __builtin_clzl() that is the question... ?
2. Should I add per-PID backend stats too or are you having something
against it?
3. Shouldn't we fix that mdsyncfiletag() mentioned earlier we seem to have
pgstat_count_io_op_time() *after* potential FileClose() (as per my
earlier question)

-J.

[1] - https://www.postgresql.org/message-id/20231115180433.p3eeaczbam5zxdz5%40awork3.anarazel.de
[2] - https://vondra.me/posts/benchmarking-is-hard-sometimes/

Attachment Content-Type Size
v3-0001-Add-pg_stat_io_histogram-view-to-provide-more-det.patch text/x-patch 32.9 KB

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Fujii Masao 2026-01-30 14:20:33 Re: logical apply worker's lock waits in subscriber can stall checkpointer in publisher
Previous Message Bernd Helmle 2026-01-30 13:27:44 Re: Change default of jit to off