Re: Show WAL write and fsync stats in pg_stat_io

From: Nazir Bilal Yavuz <byavuz81(at)gmail(dot)com>
To: Melanie Plageman <melanieplageman(at)gmail(dot)com>
Cc: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, "bharath(dot)rupireddyforpostgres(at)gmail(dot)com" <bharath(dot)rupireddyforpostgres(at)gmail(dot)com>
Subject: Re: Show WAL write and fsync stats in pg_stat_io
Date: 2023-08-03 13:38:41
Message-ID: CAN55FZ05OQ-FVD-OCbyb8RFDFViNabiL4aAbgbZFsryyEN38Cg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

Thanks for the review!

Current status of the patch is:
- 'WAL read' stats in xlogrecovery.c are added to pg_stat_io.
- IOCONTEXT_INIT is added to count 'WAL init'. 'WAL init' stats are
added to pg_stat_io.
- pg_stat_io shows different op_bytes for the IOOBJECT_WAL operations.
- Working on which 'BackendType / IOContext / IOOp' should be banned
in pg_stat_io.
- Working on adding 'WAL read' to the xlogreader.c and walsender.c.
- PendingWalStats.wal_sync and
PendingWalStats.wal_write_time/PendingWalStats.wal_sync_time are moved
to pgstat_count_io_op_n()/pgstat_count_io_op_time() respectively.

TODOs:
- Documentation.
- Thinking about how to interpret the different IOOps within the two
IOContexts and discussing what would be useful to count.
- Decide which 'BackendType / IOContext / IOOp' should not be tracked.
- Adding 'WAL read' to the xlogreader.c and walsender.c. (This could
be an another patch)
- Adding WAIT_EVENT_WAL_COPY_* operations to pg_stat_io if needed.
(This could be an another patch)

On Sat, 22 Jul 2023 at 01:30, Melanie Plageman
<melanieplageman(at)gmail(dot)com> wrote:
> I think it would be good to count WAL reads even though they are not
> currently represented in pg_stat_wal. Here is a thread discussing this
> [1].

I used the same implementation in the thread link [1]. I added 'WAL
read' to only xlogrecovery.c for now. I didn't add 'WAL read' to
xlogreader.c and walsender.c because they cause some failures on:
'!pgStatLocal.shmem->is_shutdown' asserts. I will spend more time on
these. Also, I added Bharath to CC. I have a question about 'WAL
read':
1. There are two places where 'WAL read' happens.
a. In WALRead() in xlogreader.c, it reads 'count' bytes, most of the
time count is equal to XLOG_BLCKSZ but there are some cases it is not.
For example
- in XLogSendPhysical() in walsender.c WALRead() is called by nbytes
- in WALDumpReadPage() in pg_waldump.c WALRead() is called by count
These nbytes and count variables could be different from XLOG_BLCKSZ.

b. in XLogPageRead() in xlogreader.c, it reads exactly XLOG_BLCKSZ bytes:
pg_pread(readFile, readBuf, XLOG_BLCKSZ, (off_t) readOff);

So, what should op_bytes be set to for 'WAL read' operations?

> Eventually, the docs will need an update as well. You can wait until a
> later version of the patch to do this, but I would include it in a list
> of the remaining TODOs in your next version.

Done. I shared TODOs at the top.

> I think we will also want to add an IOContext for WAL initialization.
> Then we can track how long is spent doing 'WAL init' (including filling
> the WAL file with zeroes). XLogFileInitInternal() is likely where we
> would want to add it. And op_bytes for this would likely be
> wal_segment_size. I thought I heard about someone proposing adding WAL
> init to pg_stat_wal, but I can't find the thread.

Done. I created a new IOCONTEXT_INIT IOContext for the 'WAL init'. I
have a question there:
1. Some of the WAL processes happens at initdb (standalone backend
IOCONTEXT_NORMAL/(IOOP_READ & IOOP_WRITE) and
IOCONTEXT_INIT/(IOOP_WRITE & IOOP_FSYNC)). Since this happens at the
initdb, AFAIK there is no way to set 'track_wal_io_timing' and
'track_io_timing' variables there. So, their timings appear as 0.
Should I use IsBootstrapProcessingMode() to enable WAL io timings at
the initdb or are they not that much important?

> I think there is also an argument for counting WAL files recycled as
> IOOP_REUSES. We should start thinking about how to interpret the
> different IOOps within the two IOContexts and discussing what would be
> useful to count. For example, should removing a logfile count as an
> IOOP_EVICT? Maybe it is not directly related to "IO" enough or even an
> interesting statistic, but we should think about what kinds of
> IO-related WAL statistics we want to track.

I added that to TODOs.

> Any that we decide not to count for now should be "banned" in
> pgstat_tracks_io_op() for clarity. For example, if we create a separate
> IOContext for WAL file init, I'm not sure what would count as an
> IOOP_EXTEND in IOCONTEXT_NORMAL for IOOBJECT_WAL.
>
> Also, I think there are some backend types which will not generate WAL
> and we should determine which those are and skip those rows in
> pgstat_tracks_io_object().

I agree, I am working on this. I have a couple of questions:
1. Can client backend and background worker do IOCONTEXT_NORMAL/IOOP_READ?
2. Is there an easy way to check if 'BackendType / IOOBJECT_WAL' does
specific IOOp operations?

> diff --git a/src/backend/access/transam/xlog.c
> b/src/backend/access/transam/xlog.c
> index 8b0710abe6..2ee6c21398 100644
> --- a/src/backend/access/transam/xlog.c
> +++ b/src/backend/access/transam/xlog.c
> @@ -2207,6 +2207,10 @@ XLogWrite(XLogwrtRqst WriteRqst, TimeLineID
> tli, bool flexible)
>
> I think we should likely follow the pattern of using
> pgstat_prepare_io_time() and pgstat_count_io_op_time() as it is done
> elsewhere. You could pass the IOObject as a parameter to
> pgstat_prepare_io_time() in order to determine if we should check
> track_io_timing or track_wal_io_timing. And we'll want to check
> track_wal_io_timing if IOObject is IOOBJECT_WAL in
> pgstat_count_io_op_time().

Done. Instead of passing parameters to pgstat_prepare_io_time(), I
used a slightly different implementation. I return the current time if
there is a chance that any 'time' can be tracked.

> INSTR_TIME_SET_CURRENT(duration);
>
> INSTR_TIME_ACCUM_DIFF(PendingWalStats.wal_write_time, duration,
> start);
>
> + pgstat_count_io_op_time(IOOBJECT_WAL,
> IOCONTEXT_NORMAL, IOOP_WRITE, start, 1);
> + } else
> + {
>
> Other users of pgstat_count_io_op_time()/io_op_n() which write multiple
> pages at a time pass the number of pages in as the cnt parameter. (see
> ExtendBufferedRelLocal() as an example). I think we want to do that for
> WAL also. In this case, it would be the local variable "npages" and we
> can do it outside of this loop.
>
> It is true that the existing WAL stats count wal_writes here. However,
> this is essentially counting write system calls, which is probably not
> what we want for pg_stat_io. See [2] for a discussion about whether to
> count blocks written back or writeback system calls for a previous
> pg_stat_io feature. All of the other block-based IO statistics in
> pg_stat_io count the number of blocks.
>
> This being said, we probably want to just leave
> PendingWalStats.wal_write++ here. We would normally move it into
> pg_stat_io like we have with pgBufferUsage and the db IO stats that are
> updated in pgstat_count_io_op_time(). This consolidation makes it easier
> to eventually reduce the duplication. However, in this case, it seems
> wal_write counts something we don't count in pg_stat_io, so it can
> probably be left here. I would still move the
> PendingWalStats.wal_write_time into pgstat_count_io_op_time(), since
> that seems like it is the same as what will be in pg_stat_io.

Done. I moved PendingWalStats.wal_sync and
PendingWalStats.wal_write_time/PendingWalStats.wal_sync_time to
pgstat_count_io_op_n()/pgstat_count_io_op_time() respectively. Because
of this change, pg_stat_wal's and pg_stat_io's
IOOBJECT_WAL/IOCONTEXT_NORMAL/IOOP_WRITE counts are different but the
rest are the same.

> Also, op_bytes for IOOBJECT_WAL/IOCONTEXT_NORMAL should be XLOG_BLCKSZ
> (see comment in pg_stat_get_io() in pgstatfuncs.c). Those default to the
> same value but can be made to be different.

Done.

> I would wrap this line and check other lines to make sure they are not
> too long.

Done.

>
> + } else
> + {
> + pgstat_count_io_op_n(IOOBJECT_WAL, IOCONTEXT_NORMAL, IOOP_FSYNC, 1);
> }
>
> PendingWalStats.wal_sync++;
>
> Same feedback as above about using the prepare/count pattern used for
> pg_stat_io elsewhere. In this case, you should be able to move
> PendingWalStats.wal_sync into there as well.

Done.

> > There is a track_io_timing variable to control pg_stat_io
> > timings and a track_wal_io_timing variable to control WAL timings. I
> > couldn't decide on which logic to enable WAL timings on pg_stat_io.
> > For now, both pg_stat_io and track_wal_io_timing are needed to be
> > enabled to track WAL timings in pg_stat_io.
>
> Hmm. I could see a case where someone doesn't want to incur the
> overhead of track_io_timing for regular IO but does want to do so for
> WAL because they are interested in a specific issue. I'm not sure
> though. I could be convinced otherwise (based on relative overhead,
> etc).

Done. IOOBJECT_WAL uses track_wal_io_timing regardless of
track_io_timing for now.

> [1] https://www.postgresql.org/message-id/flat/20230216191138.jotc73lqb7xhfqbi%40awork3.anarazel.de#eb4a641427fa1eb013e9ecdd8648e640
> [2] https://www.postgresql.org/message-id/20230504165738.4e2hfoddoels542c%40awork3.anarazel.de

In addition to these, are WAIT_EVENT_WAL_COPY_* operations needed to
be added to pg_stat_io? If the answer is yes, should I add them to the
current patch?

Any kind of feedback would be appreciated.

Regards,
Nazir Bilal Yavuz
Microsoft

Attachment Content-Type Size
v2-0001-Show-WAL-stats-on-pg_stat_io.patch text/x-diff 14.7 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Andy Fan 2023-08-03 13:50:15 Re: Extract numeric filed in JSONB more effectively
Previous Message Ashutosh Bapat 2023-08-03 13:29:07 Re: [PoC] Reducing planning time when tables have many partitions