Re: Show WAL write and fsync stats in pg_stat_io

From: Nazir Bilal Yavuz <byavuz81(at)gmail(dot)com>
To: Nazir Bilal Yavuz <byavuz81(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, "bharath(dot)rupireddyforpostgres(at)gmail(dot)com" <bharath(dot)rupireddyforpostgres(at)gmail(dot)com>
Subject: Re: Show WAL write and fsync stats in pg_stat_io
Date: 2023-09-20 07:57:48
Message-ID: CAN55FZ13ZViv+4V34TqQQ0aEzKbnPfDbYnBr=iJLPNNCPEuHfg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

Thanks for the review!

Current status of the patch is:
- IOOBJECT_WAL / IOCONTEXT_NORMAL read, write and fsync stats are added.
- IOOBJECT_WAL / IOCONTEXT_NORMAL write and fsync tests are added.
- IOOBJECT_WAL / IOCONTEXT_INIT stats are added.
- pg_stat_io shows different op_bytes for the IOOBJECT_WAL operations.
- Working on which 'BackendType / IOContext / IOOp' should be banned in
pg_stat_io.
- PendingWalStats.wal_sync and PendingWalStats.wal_write_time /
PendingWalStats.wal_sync_time are moved to pgstat_count_io_op_n() /
pgstat_count_io_op_time() respectively.

TODOs:
- Documentation.
- Try to set op_bytes for BackendType / IOContext.
- Decide which 'BackendType / IOContext / IOOp' should not be tracked.
- Add IOOBJECT_WAL / IOCONTEXT_NORMAL read tests.
- Add IOOBJECT_WAL / IOCONTEXT_INIT tests.

I am adding tracking of BackendType / IOContext / IOOp as tables, empty
cell means it is not decided yet:

IOCONTEXT_NORMAL / Backend / IOOp table:

╔═════════════════════╦═══════╦═══════╦═══════╗
║ IOCONTEXT_NORMAL ║ read ║ write ║ fsync ║
╠═════════════════════╬═══════╬═══════╬═══════╣
║ autovacuum launcher ║ FALSE ║ FALSE ║ FALSE ║
╠═════════════════════╬═══════╬═══════╬═══════╣
║ autovacuum worker ║ FALSE ║ TRUE ║ TRUE ║
╠═════════════════════╬═══════╬═══════╬═══════╣
║ client backend ║ ║ TRUE ║ TRUE ║
╠═════════════════════╬═══════╬═══════╬═══════╣
║ background worker ║ ║ ║ ║
╠═════════════════════╬═══════╬═══════╬═══════╣
║ background writer ║ ║ TRUE ║ TRUE ║
╠═════════════════════╬═══════╬═══════╬═══════╣
║ checkpointer ║ ║ TRUE ║ TRUE ║
╠═════════════════════╬═══════╬═══════╬═══════╣
║ standalone backend ║ TRUE ║ TRUE ║ TRUE ║
╠═════════════════════╬═══════╬═══════╬═══════╣
║ startup ║ TRUE ║ ║ ║
╠═════════════════════╬═══════╬═══════╬═══════╣
║ walreceiver ║ ║ ║ ║
╠═════════════════════╬═══════╬═══════╬═══════╣
║ walsender ║ ║ ║ ║
╠═════════════════════╬═══════╬═══════╬═══════╣
║ walwriter ║ ║ TRUE ║ TRUE ║
╚═════════════════════╩═══════╩═══════╩═══════╝

IOCONTEXT_WAL_INIT / Backend / IOOp table:

╔═════════════════════╦═══════╦═══════╗
║ IOCONTEXT_WAL_INIT ║ write ║ fsync ║
╠═════════════════════╬═══════╬═══════╣
║ autovacuum launcher ║ ║ ║
╠═════════════════════╬═══════╬═══════╣
║ autovacuum worker ║ ║ ║
╠═════════════════════╬═══════╬═══════╣
║ client backend ║ TRUE ║ TRUE ║
╠═════════════════════╬═══════╬═══════╣
║ background worker ║ ║ ║
╠═════════════════════╬═══════╬═══════╣
║ background writer ║ ║ ║
╠═════════════════════╬═══════╬═══════╣
║ checkpointer ║ ║ ║
╠═════════════════════╬═══════╬═══════╣
║ standalone backend ║ TRUE ║ TRUE ║
╠═════════════════════╬═══════╬═══════╣
║ startup ║ ║ ║
╠═════════════════════╬═══════╬═══════╣
║ walreceiver ║ ║ ║
╠═════════════════════╬═══════╬═══════╣
║ walsender ║ ║ ║
╠═════════════════════╬═══════╬═══════╣
║ walwriter ║ ║ ║
╚═════════════════════╩═══════╩═══════╝

On Wed, 9 Aug 2023 at 21:52, Melanie Plageman <melanieplageman(at)gmail(dot)com>
wrote:
>
> > On Sat, 22 Jul 2023 at 01:30, Melanie Plageman
> > <melanieplageman(at)gmail(dot)com> wrote:
> > > I think it would be good to count WAL reads even though they are not
> > > currently represented in pg_stat_wal. Here is a thread discussing this
> > > [1].
> >
> > I used the same implementation in the thread link [1]. I added 'WAL
> > read' to only xlogrecovery.c for now. I didn't add 'WAL read' to
> > xlogreader.c and walsender.c because they cause some failures on:
> > '!pgStatLocal.shmem->is_shutdown' asserts. I will spend more time on
> > these. Also, I added Bharath to CC. I have a question about 'WAL
> > read':
> > 1. There are two places where 'WAL read' happens.
> > a. In WALRead() in xlogreader.c, it reads 'count' bytes, most of the
> > time count is equal to XLOG_BLCKSZ but there are some cases it is not.
> > For example
> > - in XLogSendPhysical() in walsender.c WALRead() is called by nbytes
> > - in WALDumpReadPage() in pg_waldump.c WALRead() is called by count
> > These nbytes and count variables could be different from XLOG_BLCKSZ.
> >
> > b. in XLogPageRead() in xlogreader.c, it reads exactly XLOG_BLCKSZ
bytes:
> > pg_pread(readFile, readBuf, XLOG_BLCKSZ, (off_t) readOff);
> >
> > So, what should op_bytes be set to for 'WAL read' operations?
>
> If there is any combination of BackendType and IOContext which will
> always read XLOG_BLCKSZ bytes, we could use XLOG_BLCKSZ for that row's
> op_bytes. For other cases, we may have to consider using op_bytes 1 and
> tracking reads and write IOOps in number of bytes (instead of number of
> pages). I don't actually know if there is a clear separation by
> BackendType for these different cases.

I agree. I will edit that later, added to TODOs.

>
> The other alternative I see is to use XLOG_BLCKSZ as the op_bytes and
> treat op_bytes * number of reads as an approximation of the number of
> bytes read. I don't actually know what makes more sense. I don't think I
> would like having a number for bytes that is not accurate.

Yes, the prior one makes more sense to me.

>
> > Should I use IsBootstrapProcessingMode() to enable WAL io timings at
> > the initdb or are they not that much important?
>
> I don't have an opinion about this. I can see an argument for doing it
> either way. We do track other IO during initdb in pg_stat_io.

I didn't add it for now. It is an easy change, it could be added later.

>
> As an additional TODO, I would explore adding some tests to prevent
> accidental removal of the pg_stat_io WAL tracking.
>
> I think we can easily test IOCONTEXT_NORMAL WAL writes in
> src/test/regress/sql/stats.sql (perhaps it is worth checking that
> synchronous_commit is on in the test). IOCONTEXT_NORMAL WAL fsyncs
> should again be easy to test if synchronous_commit is on and fsync is
> on.
>
> I'm not sure how to reliably test WAL reads (given timing). Logically,
> you can sum WAL reads before a crash is initiated in one of the tests in
> the recovery suite, and then sum them after the db has restarted and
> there should definitely be an increase in WAL reads, but I don't know if
> we need to do something to guarantee that there will have been WAL reads
> (to avoid test flakes).
>
> I'm also not sure how to reliably test any IOCONTEXT_INIT operations. We
> need a before and after and I can't think of a cheap operation to ensure
> a new WAL segment is written to or fsyncd in between a before and after
> for the purposes of testing.

IOOBJECT_WAL / IOCONTEXT_NORMAL write and fsync tests are added.
For the IOCONTEXT_NORMAL reads and IOCONTEXT_INIT tests, I couldn't find a
way to avoid test flakes. I am open to suggestions. I added these to TODOs.

>
> > ---
> > diff --git a/src/backend/access/transam/xlog.c
b/src/backend/access/transam/xlog.c
> > index 60c0b7ec3af..ee7b85e18ca 100644
> > --- a/src/backend/access/transam/xlog.c
> > +++ b/src/backend/access/transam/xlog.c
> > @@ -2245,6 +2229,9 @@ XLogWrite(XLogwrtRqst WriteRqst, TimeLineID tli,
bool flexible)
> > startoffset += written;
> > } while (nleft > 0);
> >
>
> I'm not sure if the right location is here or in
> pgstat_count_io_op_time(), but I would explain why you did not move
> PendingWalStats.wal_writes counter into pg_stat_io code (and why you did
> move the other PendingWalStats counters there.
>
> > + pgstat_count_io_op_time(IOOBJECT_WAL,
IOCONTEXT_NORMAL,
> > +
IOOP_WRITE, io_start, npages);
> > +
> > npages = 0;
> >
> > /*
> > @@ -2938,6 +2925,7 @@ XLogFileInitInternal(XLogSegNo logsegno,
TimeLineID logtli,
> > int fd;
> > int save_errno;
> > int open_flags = O_RDWR | O_CREAT | O_EXCL |
PG_BINARY;
> > + instr_time io_start;
> >
> > Assert(logtli != 0);
> >
> > @@ -2981,6 +2969,8 @@ XLogFileInitInternal(XLogSegNo logsegno,
TimeLineID logtli,
> > (errcode_for_file_access(),
> > errmsg("could not create file \"%s\":
%m", tmppath)));
> >
>
> Since you have two calls to pgstat_prepare_io_time() in this function, I
> think it would be nice to have a comment above each to the effect of
> "start timing writes for stats" and "start timing fsyncs for stats"

Done.

>
> > + io_start = pgstat_prepare_io_time();
> > +
> > pgstat_report_wait_start(WAIT_EVENT_WAL_INIT_WRITE);
>
> > diff --git a/src/backend/access/transam/xlogrecovery.c
b/src/backend/access/transam/xlogrecovery.c
> > index becc2bda62e..ee850af5514 100644
> > --- a/src/backend/access/transam/xlogrecovery.c
> > +++ b/src/backend/access/transam/xlogrecovery.c
> > @@ -1587,6 +1587,7 @@ PerformWalRecovery(void)
> > XLogRecord *record;
> > bool reachedRecoveryTarget = false;
> > TimeLineID replayTLI;
> > + uint32 pgstat_report_wal_frequency = 0;
> >
> > /*
> > * Initialize shared variables for tracking progress of WAL
replay, as if
> > @@ -1745,6 +1746,16 @@ PerformWalRecovery(void)
> > */
> > ApplyWalRecord(xlogreader, record, &replayTLI);
> >
> > + /*
> > + * Report pending statistics to the cumulative
stats system once
> > + * every PGSTAT_REPORT_FREQUENCY times to not
hinder performance.
> > + */
> > + if (pgstat_report_wal_frequency++ ==
PGSTAT_REPORT_FREQUENCY)
> > + {
> > + pgstat_report_wal(false);
> > + pgstat_report_wal_frequency = 0;
> > + }
> > +
>
> Is the above needed for your patch to work? What does it do? It should
> probably be in a separate commit and should definitely have an
> explanation.

Done, I omit that part.

>
> > --- a/src/backend/utils/activity/pgstat_io.c
> > +++ b/src/backend/utils/activity/pgstat_io.c
> > @@ -87,17 +87,25 @@ pgstat_count_io_op_n(IOObject io_object, IOContext
io_context, IOOp io_op, uint3
> > Assert((unsigned int) io_op < IOOP_NUM_TYPES);
> > Assert(pgstat_tracks_io_op(MyBackendType, io_object, io_context,
io_op));
>
> I would add a comment here explaining that pg_stat_wal doesn't count WAL
> init or WAL reads.

Done.

>
> > + if(io_object == IOOBJECT_WAL && io_context == IOCONTEXT_NORMAL &&
> > + io_op == IOOP_FSYNC)
> > + PendingWalStats.wal_sync += cnt;
> > +
> > PendingIOStats.counts[io_object][io_context][io_op] += cnt;
> >
> > have_iostats = true;
> > }
>
> > +/*
> > + * Prepares io_time for pgstat_count_io_op_time() function. It needs
to return
> > + * current time if there is a chance that any 'time' can be tracked.
> > + */
> > instr_time
> > pgstat_prepare_io_time(void)
> > {
> > instr_time io_start;
> >
> > - if (track_io_timing)
> > + if(track_io_timing || track_wal_io_timing)
> > INSTR_TIME_SET_CURRENT(io_start);
> > else
> > INSTR_TIME_SET_ZERO(io_start);
>
> Since you asked me off-list why we had to do INSTR_TIME_SET_ZERO() and I
> couldn't remember, it is probably worth a comment.

Done.

>
> > pgstat_count_io_op_time(IOObject io_object, IOContext io_context, IOOp
io_op,
> > instr_time start_time,
uint32 cnt)
> > {
> > - if (track_io_timing)
> > + if (pgstat_should_track_io_time(io_object, io_context))
> > {
> > instr_time io_time;
> >
> > @@ -124,6 +148,9 @@ pgstat_count_io_op_time(IOObject io_object,
IOContext io_context, IOOp io_op,
> >
pgstat_count_buffer_write_time(INSTR_TIME_GET_MICROSEC(io_time));
>
> Now that we are adding more if statements to this function, I think we
> should start adding more comments.
>
> We should explain what the different counters here are for e.g.
> pgBufferUsage for EXPLAIN, PendingWalStats for pg_stat_wal.
>
> We should also explain what is tracked for each and why it differs --
> e.g. some track time and some don't, some track only reads or writes,
> etc.
>
> Also we should mention why we are consolidating them here. That is, we
> want to eventually deduplicate these counters, so we are consolidating
> them first. This also makes it easy to compare what is tracked for which
> stats or instrumentation purpose.
>
> And for those IO counters that we haven't moved here, we should mention
> it is because they track at a different level of granularity or at a
> different point in the call stack.

Done.

>
> > if (io_object == IOOBJECT_RELATION)
> >
INSTR_TIME_ADD(pgBufferUsage.blk_write_time, io_time);
> > + /* Track IOOBJECT_WAL/IOCONTEXT_NORMAL times on
PendingWalStats */
> > + else if (io_object == IOOBJECT_WAL && io_context
== IOCONTEXT_NORMAL)
> > +
INSTR_TIME_ADD(PendingWalStats.wal_write_time, io_time);
> > }
>
>
> Also, I would reorder the if statements to be in order of the enum
> values (e.g. FSYNC, READ, WRITE).

Done.

>
> > else if (io_op == IOOP_READ)
> > {
> > @@ -131,6 +158,12 @@ pgstat_count_io_op_time(IOObject io_object,
IOContext io_context, IOOp io_op,
> > if (io_object == IOOBJECT_RELATION)
> >
INSTR_TIME_ADD(pgBufferUsage.blk_read_time, io_time);
> > }
> > + else if (io_op == IOOP_FSYNC)
> > + {
> > + /* Track IOOBJECT_WAL/IOCONTEXT_NORMAL times on
PendingWalStats */
>
> I wouldn't squeeze this comment here like this. It is hard to read

Done.

>
> > + if (io_object == IOOBJECT_WAL && io_context ==
IOCONTEXT_NORMAL)
> > +
INSTR_TIME_ADD(PendingWalStats.wal_sync_time, io_time);
>
>
> > + * op_bytes can change according to IOObject and IOContext.
> > + * Return BLCKSZ as default.
> > + */
> > +int
> > +pgstat_get_io_op_btyes(IOObject io_object, IOContext io_context)
> > +{
>
> Small typo in function name:
> pgstat_get_io_op_btyes -> pgstat_get_io_op_bytes
> I'd also mention why BLCKSZ is the default

Done.

>
> > + if (io_object == IOOBJECT_WAL)
> > + {
> > + if (io_context == IOCONTEXT_NORMAL)
> > + return XLOG_BLCKSZ;
> > + else if (io_context == IOCONTEXT_INIT)
> > + return wal_segment_size;
> > + }
> > +
> > + return BLCKSZ;
> > +}
>
> > @@ -350,6 +405,15 @@ pgstat_tracks_io_object(BackendType bktype,
IOObject io_object,
> > if (!pgstat_tracks_io_bktype(bktype))
> > return false;
> >
> > + /*
> > + * Currently, IO on IOOBJECT_WAL IOObject can only occur in the
> > + * IOCONTEXT_NORMAL and IOCONTEXT_INIT IOContext.
> > + */
> > + if (io_object == IOOBJECT_WAL &&
> > + (io_context != IOCONTEXT_NORMAL &&
>
> Little bit of errant whitespace here.

Done.

>
> > /*
> > * Currently, IO on temporary relations can only occur in the
> > * IOCONTEXT_NORMAL IOContext.
> > @@ -439,6 +503,14 @@ pgstat_tracks_io_op(BackendType bktype, IOObject
io_object,
> > if (io_context == IOCONTEXT_BULKREAD && io_op == IOOP_EXTEND)
> > return false;
>
> I would expand on the comment to explain what NORMAL is for WAL -- what
> we consider normal to be and why. And why it is different than INIT.

Done.

>
> >
> > + if(io_object == IOOBJECT_WAL && io_context == IOCONTEXT_INIT &&
> > + !(io_op == IOOP_WRITE || io_op == IOOP_FSYNC))
> > + return false;
> > +
> > + if(io_object == IOOBJECT_WAL && io_context == IOCONTEXT_NORMAL &&
> > + !(io_op == IOOP_WRITE || io_op == IOOP_READ || io_op ==
IOOP_FSYNC))
> > + return false;
>
> These are the first "bans" that we have for an IOOp for a specific
> combination of io_context and io_object. We should add a new comment for
> this and perhaps consider what ordering makes most sense. I tried to
> organize the bans from most broad to most specific at the bottom.

Done.

>
> >
> > --- a/src/backend/utils/adt/pgstatfuncs.c
> > +++ b/src/backend/utils/adt/pgstatfuncs.c
> > @@ -1409,7 +1410,8 @@ pg_stat_get_io(PG_FUNCTION_ARGS)
> > * and constant multipliers, once
non-block-oriented IO (e.g.
> > * temporary file IO) is tracked.
> > */
> > - values[IO_COL_CONVERSION] =
Int64GetDatum(BLCKSZ);
>
> There's a comment above this in the code that says this is hard-coded to
> BLCKSZ. That comment needs to be updated or removed (in lieu of the
> comment in your pgstat_get_io_op_bytes() function).

Done.

>
>
> > + op_bytes = pgstat_get_io_op_btyes(io_obj,
io_context);
> > + values[IO_COL_CONVERSION] =
Int64GetDatum(op_bytes);
> >
>
> > +extern PGDLLIMPORT bool track_wal_io_timing;
> > +extern PGDLLIMPORT int wal_segment_size;
>
> These shouldn't be in two places (i.e. they are already in xlog.h and
> you added them in pgstat.h. pg_stat_io.c includes bufmgr.h for
> track_io_timing, so you can probably justify including xlog.h.

Done.

Any kind of feedback would be appreciated.

Regards,
Nazir Bilal Yavuz
Microsoft

Attachment Content-Type Size
v3-0002-Add-IOOBJECT_WAL-IOCONTEXT_NORMAL-write-and-fsync.patch text/x-diff 3.3 KB
v3-0001-Show-WAL-stats-except-streaming-replication-WAL-o.patch text/x-diff 16.4 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Fujii Masao 2023-09-20 08:27:22 Re: pg_rewind with cascade standby doesn't work well
Previous Message Andrey Lepikhov 2023-09-20 07:56:53 Re: disfavoring unparameterized nested loops