Re: pg_stat_io not tracking smgrwriteback() is confusing

From: Andres Freund <andres(at)anarazel(dot)de>
To: pgsql-hackers(at)postgresql(dot)org, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Alvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org>, "Jonathan S(dot) Katz" <jkatz(at)postgresql(dot)org>
Subject: Re: pg_stat_io not tracking smgrwriteback() is confusing
Date: 2023-04-24 21:14:32
Message-ID: 20230424211432.u2z6v6mceemznd6t@awork3.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

On 2023-04-24 16:39:36 -0400, Melanie Plageman wrote:
> On Wed, Apr 19, 2023 at 10:23:26AM -0700, Andres Freund wrote:
> > Hi,
> >
> > I noticed that the numbers in pg_stat_io dont't quite add up to what I
> > expected in write heavy workloads. Particularly for checkpointer, the numbers
> > for "write" in log_checkpoints output are larger than what is visible in
> > pg_stat_io.
> >
> > That partially is because log_checkpoints' "write" covers way too many things,
> > but there's an issue with pg_stat_io as well:
> >
> > Checkpoints, and some other sources of writes, will often end up doing a lot
> > of smgrwriteback() calls - which pg_stat_io doesn't track. Nor do any
> > pre-existing forms of IO statistics.
> >
> > It seems pretty clear that we should track writeback as well. I wonder if it's
> > worth doing so for 16? It'd give a more complete picture that way. The
> > counter-argument I see is that we didn't track the time for it in existing
> > stats either, and that nobody complained - but I suspect that's mostly because
> > nobody knew to look.
>
> Not complaining about making pg_stat_io more accurate, but what exactly
> would we be tracking for smgrwriteback()? I assume you are talking about
> IO timing. AFAICT, on Linux, it does sync_file_range() with
> SYNC_FILE_RANGE_WRITE, which is asynchronous. Wouldn't we just be
> tracking the system call overhead time?

It starts blocking once "enough" IO is in flight. For things like an immediate
checkpoint, that can happen fairly quickly, unless you have a very fast IO
subsystem. So often it'll not matter whether we track smgrwriteback(), but
when it matter, it can matter a lot.

As an example, I inited' a pgbench w/ scale 1000, on a decent but not great
NVMe SSD. Created dirty data with:

c=96;/srv/dev/build/m-opt/src/bin/pgbench/pgbench --random-seed=0 -n -M prepared -c$c -j$c -T30 -P1
and then measured the checkpoint:
perf trace -s -p $pid_of_checkpointer psql -x -c "SELECT pg_stat_reset_shared('io')" -c "checkpoint"

postgres (367444), 1891280 events, 100.0%

syscall calls errors total min avg max stddev
(msec) (msec) (msec) (msec) (%)
--------------- -------- ------ -------- --------- --------- --------- ------
sync_file_range 359176 0 4560.670 0.002 0.013 238.955 10.36%
pwrite64 582964 0 2874.634 0.003 0.005 0.156 0.06%
fsync 242 0 251.631 0.001 1.040 42.166 18.81%
openat 317 65 2.171 0.002 0.007 0.068 5.69%
rename 69 0 1.973 0.012 0.029 0.084 5.81%
fdatasync 1 0 1.543 1.543 1.543 1.543 0.00%
unlink 150 137 1.278 0.002 0.009 0.062 10.69%
close 250 0 0.694 0.001 0.003 0.007 3.14%
newfstatat 140 68 0.667 0.001 0.005 0.022 7.26%
write 5 0 0.067 0.005 0.013 0.025 24.55%
lseek 14 0 0.050 0.001 0.004 0.018 33.87%
getdents64 8 0 0.047 0.002 0.006 0.022 39.51%
kill 3 0 0.029 0.008 0.010 0.011 10.18%
epoll_wait 2 1 0.006 0.000 0.003 0.006 100.00%
read 1 0 0.004 0.004 0.004 0.004 0.00%

Log output:

2023-04-24 14:11:59.234 PDT [367444][checkpointer][:0][] LOG: checkpoint starting: immediate force wait
2023-04-24 14:12:09.236 PDT [367444][checkpointer][:0][] LOG: checkpoint complete: wrote 595974 buffers (28.4%); 0 WAL file(s) added, 0 removed, 68 recycled; write=9.740 s, sync=0.057 s, total=10.002 s; sync files=27, longest=0.043 s, average=0.003 s; distance=4467386 kB, estimate=4467386 kB; lsn=6/E5D33F98, redo lsn=6/E5D33F28

# SELECT writes, write_time, fsyncs, fsync_time FROM pg_stat_io WHERE backend_type = 'checkpointer';
┌────────┬────────────────────┬────────┬────────────┐
│ writes │ write_time │ fsyncs │ fsync_time │
├────────┼────────────────────┼────────┼────────────┤
│ 595914 │ 4002.1730000000002 │ 24 │ 46.359 │
└────────┴────────────────────┴────────┴────────────┘

Greetings,

Andres Freund

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tomas Vondra 2023-04-24 21:20:32 Re: Missing update of all_hasnulls in BRIN opclasses
Previous Message Tomas Vondra 2023-04-24 21:10:55 Re: Missing update of all_hasnulls in BRIN opclasses