Re: Show WAL write and fsync stats in pg_stat_io

From: Nazir Bilal Yavuz <byavuz81(at)gmail(dot)com>
To: Melanie Plageman <melanieplageman(at)gmail(dot)com>
Cc: Michael Paquier <michael(at)paquier(dot)xyz>, Andres Freund <andres(at)anarazel(dot)de>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, "bharath(dot)rupireddyforpostgres(at)gmail(dot)com" <bharath(dot)rupireddyforpostgres(at)gmail(dot)com>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>
Subject: Re: Show WAL write and fsync stats in pg_stat_io
Date: 2024-01-12 13:23:26
Message-ID: CAN55FZ1=ieDk2DdF7NY8W61AaozBBf1Eg-ECZBU3ravX_o==FA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

On Thu, 11 Jan 2024 at 17:28, Melanie Plageman
<melanieplageman(at)gmail(dot)com> wrote:
>
> On Thu, Jan 11, 2024 at 6:19 AM Nazir Bilal Yavuz <byavuz81(at)gmail(dot)com> wrote:
> >
> > On Thu, 11 Jan 2024 at 08:01, Michael Paquier <michael(at)paquier(dot)xyz> wrote:
> > >
> > > On Wed, Jan 10, 2024 at 07:24:50PM -0500, Melanie Plageman wrote:
> > > > On Wed, Jan 3, 2024 at 8:11 AM Nazir Bilal Yavuz <byavuz81(at)gmail(dot)com> wrote:
> > > >> On Sun, 31 Dec 2023 at 03:58, Michael Paquier <michael(at)paquier(dot)xyz> wrote:
> > > >> Oh, I understand it now. Yes, that makes sense.
> > > >> I thought removing op_bytes completely ( as you said "This patch
> > > >> extends it with two more operation sizes, and there are even cases
> > > >> where it may be a variable" ) from pg_stat_io view then adding
> > > >> something like {read | write | extend}_bytes and {read | write |
> > > >> extend}_calls could be better, so that we don't lose any information.
> > > >
> > > > Upthread, Michael says:
> > > >
> > > >> I find the use of 1 in this context a bit confusing, because when
> > > >> referring to a counter at N, then it can be understood as doing N
> > > >> times a operation,
> > > >
> > > > I didn't understand this argument, so I'm not sure if I agree or
> > > > disagree with it.
> > >
> > > Nazir has mentioned upthread one thing: what should we do for the case
> > > where a combination of (io_object,io_context) does I/O with a
> > > *variable* op_bytes, because that may be the case for the WAL
> > > receiver? For this case, he has mentioned that we should set op_bytes
> > > to 1, but that's something I find confusing because it would mean that
> > > we are doing read, writes or extends 1 byte at a time. My suggestion
> > > would be to use op_bytes = -1 or NULL for the variable case instead,
> > > with reads, writes and extends referring to a number of bytes rather
> > > than a number of operations.
> >
> > I agree but we can't do this only for the *variable* cases since
> > B_WAL_RECEIVER and other backends use the same
> > pgstat_count_io_op_time(IOObject, IOContext, ...) call. What I mean
> > is, if two backends use the same pgstat_count_io_op_time() function
> > call in the code; they must count the same thing (number of calls,
> > bytes, etc.). It could be better to count the number of bytes for all
> > the IOOBJECT_WAL IOs.
>
> I'm a bit confused by this. pgstat_count_io_op_time() can check
> MyBackendType. In fact, you do this to ban the wal receiver already.
> It is true that you would need to count all wal receiver normal
> context wal object IOOps in the variable way, but I don't see how
> pgstat_count_io_op_time() is the limiting factor as long as the
> callsite is always doing either the number of bytes or the number of
> calls.

Apologies for not being clear. Let me try to explain this by giving examples:

Let's assume that there are 3 different pgstat_count_io_op_time()
calls in the code base and they are labeled as 1, 2 and 3.

And let's' assume that: WAL receiver uses 1st and 2nd
pgstat_count_io_op_time(), autovacuum uses 2nd and 3rd
pgstat_count_io_op_time() and checkpointer uses 3rd
pgstat_count_io_op_time() to count IOs.

The 1st one is the only pgstat_count_io_op_time() call that must count
the number of bytes because of the variable cases and the others count
the number of calls or blocks.

a) WAL receiver uses both 1st and 2nd => 1st and 2nd
pgstat_count_io_op_time() must count the same thing => 2nd
pgstat_count_io_op_time() must count the number of bytes as well.

b) 2nd pgstat_count_io_op_time() started to count the number of bytes
=> Autovacuum will start to count the number of bytes => 2nd and 3rd
both are used by autocavuum => 3rd pgstat_count_io_op_time() must
count the number of bytes as well.

c) 3rd pgstat_count_io_op_time() started to count the number of bytes
=> Checkpointer will start to count the number of bytes.

The list goes on like this and if we don't have [write | read |
extend]_bytes, this effect will be multiplied.

> > > > I think these are the three proposals for handling WAL reads:
> > > >
> > > > 1) setting op_bytes to 1 and the number of reads is the number of bytes
> > > > 2) setting op_bytes to XLOG_BLCKSZ and the number of reads is the
> > > > number of calls to pg_pread() or similar
> > > > 3) setting op_bytes to NULL and the number of reads is the number of
> > > > calls to pg_pread() or similar
> > >
> > > 3) could be a number of bytes, actually.
> >
> > One important point is that we can't change only reads, if we decide
> > to count the number of bytes for the reads; writes and extends should
> > be counted as a number of bytes as well.
>
> Yes, that is true.
>
> > > > Looking at the patch, I think it is still doing 2.
> > >
> > > The patch disables stats for the WAL receiver, while the startup
> > > process reads WAL with XLOG_BLCKSZ, so yeah that's 2) with a trick to
> > > discard the variable case.
> > >
> > > > For an unpopular idea: we could add separate [IOOp]_bytes columns for
> > > > all those IOOps for which it would be relevant. It kind of stinks but
> > > > it would give us the freedom to document exactly what a single IOOp
> > > > means for each combination of BackendType, IOContext, IOObject, and
> > > > IOOp (as relevant) and still have an accurate number in the *bytes
> > > > columns. Everyone will probably hate us if we do that, though.
> > > > Especially because having bytes for the existing IOObjects is an
> > > > existing feature.
> > >
> > > An issue I have with this one is that having multiple tuples for
> > > each (object,context) if they have multiple op_bytes leads to
> > > potentially a lot of bloat in the view. That would be up to 8k extra
> > > tuples just for the sake of op_byte's variability.
> >
> > Yes, that doesn't seem applicable to me.
>
> My suggestion (again not sure it is a good idea) was actually that we
> remove op_bytes and add "write_bytes", "read_bytes", and
> "extend_bytes". AFAICT, this would add columns not rows. In this
> schema, read bytes for wal receiver could be counted in one way and
> writes in another. We could document that, for wal receiver, the reads
> are not always done in units of the same size, so the read_bytes /
> reads could be thought of as an average size of read.

That looks like one of the best options to me. I suggested something
similar upthread and Michael's answer was:

> But then you'd lose the possibility to analyze correlations between
> the size and the number of the operations, which is something that
> matters for more complex I/O scenarios. This does not need to be
> tackled in this patch, which is useful on its own, though I am really
> wondering if this is required for the recent work done by Thomas.
> Perhaps Andres, Thomas or Melanie could comment on that?

> Even if we made a separate view for WAL I/O stats, we would still have
> this issue of variable sized I/O vs block sized I/O and would probably
> end up solving it with separate columns for the number of bytes and
> number of operations.

Yes, I think it is more about flexibility and not changing the already
published pg_stat_io view.

> > > > A separate question: suppose [1] goes in (to read WAL from WAL buffers
> > > > directly). Now, WAL reads are not from permanent storage anymore. Are
> > > > we only tracking permanent storage I/O in pg_stat_io? I also had this
> > > > question for some of the WAL receiver functions. Should we track any
> > > > I/O other than permanent storage I/O? Or did I miss this being
> > > > addressed upthread?
> > >
> > > That's a good point. I guess that this should just be a different
> > > IOOp? That's not a IOOP_READ. A IOOP_HIT is also different.
> >
> > I think different IOContext rather than IOOp suits better for this.
>
> That makes sense to me.
>
> > > > In terms of what I/O we should track in a streaming/asynchronous
> > > > world, the options would be:
> > > >
> > > > 1) track read/write syscalls
> > > > 2) track blocks of BLCKSZ submitted to the kernel
> > > > 3) track bytes submitted to the kernel
> > > > 4) track merged I/Os (after doing any merging in the application)
> > > >
> > > > I think the debate was largely between 2 and 4. There was some
> > > > disagreement, but I think we landed on 2 because there is merging that
> > > > can happen at many levels in the storage stack (even the storage
> > > > controller). Distinguishing between whether or not Postgres submitted
> > > > 2 32k I/Os or 8 8k I/Os could be useful while you are developing AIO,
> > > > but I think it might be confusing for the Postgres user trying to
> > > > determine why their query is slow. It probably makes the most sense to
> > > > still track in block size.
> > > >
> > > > No matter what solution we pick, you should get a correct number if
> > > > you multiply op_bytes by an IOOp (assuming nothing is NULL). Or,
> > > > rather, there should be some way of getting an accurate number in
> > > > bytes of the amount of a particular kind of I/O that has been done.
> > >
> > > Yeah, coming back to op_bytes = -1/NULL as a tweak to mean that reads,
> > > writes or extends are counted as bytes, because we don't have a fixed
> > > operation size for some (object,context) cases.
> >
> > Can't we use 2 and 3 together? For example, use 3 for the IOOBJECT_WAL
> > IOs and 2 for the other IOs.
>
> We can do this. One concern I have is that much of WAL I/O is done in
> XLOG_BLCKSZ, so it feels kind of odd for all WAL I/O to appear as if
> it is being done in random chunks of bytes. We anticipated other
> uniformly non-block-based I/O types where having 1 in op_bytes would
> be meaningful. I didn't realize at the time that there would be
> variable-sized and block-sized I/O mixed together for the same backend
> type, io object, and io context.

Correct. What is the lowest level that can use two different options?
I mean, could we use 3 for the WAL receiver, IOOP_READ, IOOBJECT_WAL,
IOCONTEXT_NORMAL and the 2 for the rest?

--
Regards,
Nazir Bilal Yavuz
Microsoft

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message jian he 2024-01-12 13:27:48 Re: Compile warnings in dbcommands.c building with meson
Previous Message Pavel Stehule 2024-01-12 12:35:24 Re: plpgsql memory leaks