Re: Show WAL write and fsync stats in pg_stat_io

From: Michael Paquier <michael(at)paquier(dot)xyz>
To: Melanie Plageman <melanieplageman(at)gmail(dot)com>
Cc: Nazir Bilal Yavuz <byavuz81(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, "bharath(dot)rupireddyforpostgres(at)gmail(dot)com" <bharath(dot)rupireddyforpostgres(at)gmail(dot)com>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>
Subject: Re: Show WAL write and fsync stats in pg_stat_io
Date: 2024-01-11 05:00:54
Message-ID: ZZ92BliJve8qoe7j@paquier.xyz
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, Jan 10, 2024 at 07:24:50PM -0500, Melanie Plageman wrote:
> I have code review feedback as well, but I've saved that for my next email.

Ah, cool.

> On Wed, Jan 3, 2024 at 8:11 AM Nazir Bilal Yavuz <byavuz81(at)gmail(dot)com> wrote:
>> On Sun, 31 Dec 2023 at 03:58, Michael Paquier <michael(at)paquier(dot)xyz> wrote:
>> Oh, I understand it now. Yes, that makes sense.
>> I thought removing op_bytes completely ( as you said "This patch
>> extends it with two more operation sizes, and there are even cases
>> where it may be a variable" ) from pg_stat_io view then adding
>> something like {read | write | extend}_bytes and {read | write |
>> extend}_calls could be better, so that we don't lose any information.
>
> Upthread, Michael says:
>
>> I find the use of 1 in this context a bit confusing, because when
>> referring to a counter at N, then it can be understood as doing N
>> times a operation,
>
> I didn't understand this argument, so I'm not sure if I agree or
> disagree with it.

Nazir has mentioned upthread one thing: what should we do for the case
where a combination of (io_object,io_context) does I/O with a
*variable* op_bytes, because that may be the case for the WAL
receiver? For this case, he has mentioned that we should set op_bytes
to 1, but that's something I find confusing because it would mean that
we are doing read, writes or extends 1 byte at a time. My suggestion
would be to use op_bytes = -1 or NULL for the variable case instead,
with reads, writes and extends referring to a number of bytes rather
than a number of operations.

> I think these are the three proposals for handling WAL reads:
>
> 1) setting op_bytes to 1 and the number of reads is the number of bytes
> 2) setting op_bytes to XLOG_BLCKSZ and the number of reads is the
> number of calls to pg_pread() or similar
> 3) setting op_bytes to NULL and the number of reads is the number of
> calls to pg_pread() or similar

3) could be a number of bytes, actually.

> Looking at the patch, I think it is still doing 2.

The patch disables stats for the WAL receiver, while the startup
process reads WAL with XLOG_BLCKSZ, so yeah that's 2) with a trick to
discard the variable case.

> For an unpopular idea: we could add separate [IOOp]_bytes columns for
> all those IOOps for which it would be relevant. It kind of stinks but
> it would give us the freedom to document exactly what a single IOOp
> means for each combination of BackendType, IOContext, IOObject, and
> IOOp (as relevant) and still have an accurate number in the *bytes
> columns. Everyone will probably hate us if we do that, though.
> Especially because having bytes for the existing IOObjects is an
> existing feature.

An issue I have with this one is that having multiple tuples for
each (object,context) if they have multiple op_bytes leads to
potentially a lot of bloat in the view. That would be up to 8k extra
tuples just for the sake of op_byte's variability.

> A separate question: suppose [1] goes in (to read WAL from WAL buffers
> directly). Now, WAL reads are not from permanent storage anymore. Are
> we only tracking permanent storage I/O in pg_stat_io? I also had this
> question for some of the WAL receiver functions. Should we track any
> I/O other than permanent storage I/O? Or did I miss this being
> addressed upthread?

That's a good point. I guess that this should just be a different
IOOp? That's not a IOOP_READ. A IOOP_HIT is also different.

> In terms of what I/O we should track in a streaming/asynchronous
> world, the options would be:
>
> 1) track read/write syscalls
> 2) track blocks of BLCKSZ submitted to the kernel
> 3) track bytes submitted to the kernel
> 4) track merged I/Os (after doing any merging in the application)
>
> I think the debate was largely between 2 and 4. There was some
> disagreement, but I think we landed on 2 because there is merging that
> can happen at many levels in the storage stack (even the storage
> controller). Distinguishing between whether or not Postgres submitted
> 2 32k I/Os or 8 8k I/Os could be useful while you are developing AIO,
> but I think it might be confusing for the Postgres user trying to
> determine why their query is slow. It probably makes the most sense to
> still track in block size.
>
> No matter what solution we pick, you should get a correct number if
> you multiply op_bytes by an IOOp (assuming nothing is NULL). Or,
> rather, there should be some way of getting an accurate number in
> bytes of the amount of a particular kind of I/O that has been done.

Yeah, coming back to op_bytes = -1/NULL as a tweak to mean that reads,
writes or extends are counted as bytes, because we don't have a fixed
operation size for some (object,context) cases.
--
Michael

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Michael Paquier 2024-01-11 05:12:00 Re: introduce dynamic shared memory registry
Previous Message vignesh C 2024-01-11 04:20:19 Re: Documentation to upgrade logical replication cluster