Re: pg_stat_bgwriter.buffers_backend is pretty meaningless (and more?)

From: Melanie Plageman <melanieplageman(at)gmail(dot)com>
To: Andres Freund <andres(at)anarazel(dot)de>
Cc: Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>, Justin Pryzby <pryzby(at)telsasoft(dot)com>, Alvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org>, Magnus Hagander <magnus(at)hagander(dot)net>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>, Lukas Fittl <lukas(at)fittl(dot)com>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>
Subject: Re: pg_stat_bgwriter.buffers_backend is pretty meaningless (and more?)
Date: 2022-07-13 17:14:52
Message-ID: CAAKRu_bM55pj3pPRW0nd_-paWHLRkOU69r816AeztBBa-N1HLA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Attached patch set is substantially different enough from previous
versions that I kept it as a new patch set.
Note that local buffer allocations are now correctly tracked.

On Tue, Jul 12, 2022 at 1:01 PM Andres Freund <andres(at)anarazel(dot)de> wrote:

> Hi,
>
> On 2022-07-12 12:19:06 -0400, Melanie Plageman wrote:
> > > > I also realized that I am not differentiating between IOPATH_SHARED
> and
> > > > IOPATH_STRATEGY for IOOP_FSYNC. But, given that we don't know what
> type
> > > > of buffer we are fsync'ing by the time we call
> register_dirty_segment(),
> > > > I'm not sure how we would fix this.
> > >
> > > I think there scarcely happens flush for strategy-loaded buffers. If
> > > that is sensible, IOOP_FSYNC would not make much sense for
> > > IOPATH_STRATEGY.
> > >
> >
> > Why would it be less likely for a backend to do its own fsync when
> > flushing a dirty strategy buffer than a regular dirty shared buffer?
>
> We really just don't expect a backend to do many segment fsyncs at
> all. Otherwise there's something wrong with the forwarding mechanism.
>

When a dirty strategy buffer is written out, if pendingOps sync queue is
full and the backend has to fsync the segment itself instead of relying
on the checkpointer, this will show in the statistics as an IOOP_FSYNC
for IOPATH_SHARED not IOPATH_STRATEGY.
IOPATH_STRATEGY + IOOP_FSYNC will always be 0 for all BackendTypes.
Does this seem right?

>
> It'd be different if we tracked WAL fsyncs more granularly - which would be
> quite interesting - but that's something for another day^Wpatch.
>
>
I do have a question about this.
So, if we were to start tracking WAL IO would it fit within this
paradigm to have a new IOPATH_WAL for WAL or would it add a separate
dimension?

I was thinking that we might want to consider calling this view
pg_stat_io_data because we might want to have a separate view,
pg_stat_io_wal and then, maybe eventually, convert pg_stat_slru to
pg_stat_io_slru (or a subset of what is in pg_stat_slru).
And maybe then later add pg_stat_io_[archiver/other]

Is pg_stat_io_data a good name that gives us flexibility to
introduce views which expose per-backend IO operation stats (maybe that
goes in pg_stat_activity, though [or maybe not because it wouldn't
include exited backends?]) and per query IO operation stats?

I would like to add roughly the same additional columns to all of
these during AIO development (basically the columns from iostat):
- average block size (will usually be 8kB for pg_stat_io_data but won't
necessarily for the others)
- IOPS/BW
- avg read/write wait time
- demand rate/completion rate
- merges
- maybe queue depth

And I would like to be able to see all of these per query, per backend,
per relation, per BackendType, per IOPath, per SLRU type, etc.

Basically, what I'm asking is
1) what can we name the view to enable these future stats to exist with
the least confusing/wordy view names?
2) will the current view layout and column titles work with minimal
changes for future stats extensions like what I mention above?

>
> > > > > Wonder if it's worth making the lock specific to the backend type?
> > > > >
> > > >
> > > > I've added another Lock into PgStat_IOPathOps so that each
> BackendType
> > > > can be locked separately. But, I've also kept the lock in
> > > > PgStatShared_BackendIOPathOps so that reset_all and snapshot could be
> > > > done easily.
> > >
> > > Looks fine about the lock separation.
> > >
> >
> > Actually, I think it is not safe to use both of these locks. So for
> > picking one method, it is probably better to go with the locks in
> > PgStat_IOPathOps, it will be more efficient for flush (and not for
> > fetching and resetting), so that is probably the way to go, right?
>
> I think it's good to just use one kind of lock, and efficiency of
> snapshotting
> / resetting is nearly irrelevant. But I don't see why it's not safe to use
> both kinds of locks?
>
>
The way I implemented it was not safe because I didn't use both locks
when resetting the stats.

In this new version of the patch, I've done the following: In shared
memory I've put the lock in PgStatShared_IOPathOps -- the data structure
which contains an array of PgStat_IOOpCounters for all IOOp types for
all IOPaths. Thus, different BackendType + IOPath combinations can be
updated concurrently without contending for the same lock.

To make this work, I made two versions of the PgStat_IOPathOps -- one
that has the lock, PgStatShared_IOPathOps, and one without,
PgStat_IOPathOps, so that I can persist it to the stats file without
writing and reading the LWLock and can have a local and snapshot version
of the data structure without the lock.

This also necessitated two versions of the data structure wrapping
PgStat_IOPathOps, PgStat_BackendIOPathOps, which contains an array with
a PgStat_IOPathOps for each BackendType, and
PgStatShared_BackendIOPathOps, containing an array of
PgStatShared_IOPathOps.

>
> > > Looks fine, but I think pgstat_flush_io_ops() need more comments like
> > > other pgstat_flush_* functions.
> > >
> > > + for (int i = 0; i < BACKEND_NUM_TYPES; i++)
> > > + stats_shmem->stats[i].stat_reset_timestamp = ts;
> > >
> > > I'm not sure we need a separate reset timestamp for each backend type
> > > but SLRU counter does the same thing..
> > >
> >
> > Yes, I think for SLRU stats it is because you can reset individual SLRU
> > stats. Also there is no wrapper data structure to put it in. I could
> > keep it in PgStatShared_BackendIOPathOps since you have to reset all IO
> > operation stats at once, but I am thinking of getting rid of
> > PgStatShared_BackendIOPathOps since it is not needed if I only keep the
> > locks in PgStat_IOPathOps and make the global shared value an array of
> > PgStat_IOPathOps.
>
> I'm strongly against introducing super granular reset timestamps. I think
> that
> was a mistake for SLRU stats, but we can't fix that as easily.
>
>
Since all stats in pg_stat_io must be reset at the same time, I've put
the reset timestamp can in the PgStat[Shared]_BackendIOPathOps and
removed it from each PgStat[Shared]_IOPathOps.

>
> > Currently, strategy allocs count only reuses of a strategy buffer (not
> > initial shared buffers which are added to the ring).
> > strategy writes count only the writing out of dirty buffers which are
> > already in the ring and are being reused.
>
> That seems right to me.
>
>
> > Alternatively, we could also count as strategy allocs all those buffers
> > which are added to the ring and count as strategy writes all those
> > shared buffers which are dirty when initially added to the ring.
>
> I don't think that'd provide valuable information. The whole reason that
> strategy writes are interesting is that they can lead to writing out data a
> lot sooner than they would be written out without a strategy being used.
>
>
Then I agree that strategy writes should only count strategy buffers
that are written out in order to reuse the buffer (which is in lieu of
getting a new, potentially clean, shared buffer). This patch implements
that behavior.

However, for strategy allocs, it seems like we would want to count all
demand for buffers as part of a BufferAccessStrategy. So, that would
include allocating buffers to initially fill the ring, allocations of
new shared buffers after the ring was already full that are added to the
ring because all existing buffers in the ring are pinned, and buffers
already in the ring which are being reused.

This version of the patch only counts the third scenario as a strategy
allocation, but I think it would make more sense to count all three as
strategy allocs.

The downside of this behavior is that strategy allocs count different
scenarios than strategy writes, reads, and extends. But, I think that
this is okay.

I'll clarify it in the docs once there is a decision.

Also, note that, as stated above, there will never be any strategy
fsyncs (that is, IOPATH_STRATEGY + IOOP_FSYNC will always be 0) because
the code path starting with register_dirty_segment() which ends with a
regular backend doing its own fsync when pendingOps is full does not
know what the current IOPATH is and checkpointer does not use a
BufferAccessStrategy.

>
> > Subject: [PATCH v24 2/3] Track IO operation statistics
> >
> > Introduce "IOOp", an IO operation done by a backend, and "IOPath", the
> > location or type of IO done by a backend. For example, the checkpointer
> > may write a shared buffer out. This would be counted as an IOOp write on
> > an IOPath IOPATH_SHARED by BackendType "checkpointer".
>
> I'm still not 100% happy with IOPath - seems a bit too easy to confuse with
> the file path. What about 'origin'?
>
>
Enough has changed in this version of the patch that I decided to defer
renaming until some of the other issues are resolved.

>
> > Each IOOp (alloc, fsync, extend, write) is counted per IOPath
> > (direct, local, shared, or strategy) through a call to
> > pgstat_count_io_op().
>
> It seems we should track reads too - it's quite interesting to know whether
> reads happened because of a strategy, for example. You do reference reads
> in a
> later part of the commit message even :)
>

I've added reads to what is counted.

>
> > The primary concern of these statistics is IO operations on data blocks
> > during the course of normal database operations. IO done by, for
> > example, the archiver or syslogger is not counted in these statistics.
>
> We could extend this at a later stage, if we really want to. But I'm not
> sure
> it's interesting or fully possible. E.g. the archiver's write are largely
> not
> done by the archiver itself, but by a command (or module these days) it
> shells
> out to.
>

I've added note of this to some of the comments and the commit message.
I also omit rows for these BackendTypes from the view. See my later
comment in this email for more detail on that.

>
> > Note that this commit does not add code to increment IOPATH_DIRECT. A
> > future patch adding wrappers for smgrwrite(), smgrextend(), and
> > smgrimmedsync() would provide a good location to call
> > pgstat_count_io_op() for unbuffered IO and avoid regressions for future
> > users of these functions.
>
> Hm. Perhaps we should defer introducing IOPATH_DIRECT for now then?
>
>
It's gone.

>
> > Stats on IOOps for all IOPaths for a backend are initially accumulated
> > locally.
> >
> > Later they are flushed to shared memory and accumulated with those from
> > all other backends, exited and live.
>
> Perhaps mention here that this later could be extended to make
> per-connection
> stats visible?
>
>
Mentioned.

>
> > Some BackendTypes will not execute pgstat_report_stat() and thus must
> > explicitly call pgstat_flush_io_ops() in order to flush their backend
> > local IO operation statistics to shared memory.
>
> Maybe add "flush ... during ongoing operation" or such? Because they'd all
> flush at commit, IIRC.
>
>
Added.

>
> > diff --git a/src/backend/bootstrap/bootstrap.c
> b/src/backend/bootstrap/bootstrap.c
> > index 088556ab54..963b05321e 100644
> > --- a/src/backend/bootstrap/bootstrap.c
> > +++ b/src/backend/bootstrap/bootstrap.c
> > @@ -33,6 +33,7 @@
> > #include "miscadmin.h"
> > #include "nodes/makefuncs.h"
> > #include "pg_getopt.h"
> > +#include "pgstat.h"
> > #include "storage/bufmgr.h"
> > #include "storage/bufpage.h"
> > #include "storage/condition_variable.h"
>
> Hm?
>

Removed

>
> > diff --git a/src/backend/postmaster/walwriter.c
> b/src/backend/postmaster/walwriter.c
> > index e926f8c27c..beb46dcb55 100644
> > --- a/src/backend/postmaster/walwriter.c
> > +++ b/src/backend/postmaster/walwriter.c
> > @@ -293,18 +293,7 @@ HandleWalWriterInterrupts(void)
> > }
> >
> > if (ShutdownRequestPending)
> > - {
> > - /*
> > - * Force reporting remaining WAL statistics at process
> exit.
> > - *
> > - * Since pgstat_report_wal is invoked with 'force' is
> false in main
> > - * loop to avoid overloading the cumulative stats system,
> there may
> > - * exist unreported stats counters for the WAL writer.
> > - */
> > - pgstat_report_wal(true);
> > -
> > proc_exit(0);
> > - }
> >
> > /* Perform logging of memory contexts of this process */
> > if (LogMemoryContextPending)
>
> Let's do this in a separate commit and get it out of the way...
>
>
I've put it in a separate commit.

>
> > @@ -682,16 +694,37 @@ AddBufferToRing(BufferAccessStrategy strategy,
> BufferDesc *buf)
> > * if this buffer should be written and re-used.
> > */
> > bool
> > -StrategyRejectBuffer(BufferAccessStrategy strategy, BufferDesc *buf)
> > +StrategyRejectBuffer(BufferAccessStrategy strategy, BufferDesc *buf,
> bool *write_from_ring)
> > {
> > - /* We only do this in bulkread mode */
> > +
> > + /*
> > + * We only reject reusing and writing out the strategy buffer this
> in
> > + * bulkread mode.
> > + */
> > if (strategy->btype != BAS_BULKREAD)
> > + {
> > + /*
> > + * If the buffer was from the ring and we are not
> rejecting it, consider it
> > + * a write of a strategy buffer.
> > + */
> > + if (strategy->current_was_in_ring)
> > + *write_from_ring = true;
>
> Hm. This is set even if the buffer wasn't dirty? I guess we don't expect
> StrategyRejectBuffer() to be called for clean buffers...
>
>
Yes, we do not expect it to be called for clean buffers.
I've added a comment about this assumption.

>
> > /*
> > diff --git a/src/backend/utils/activity/pgstat_database.c
> b/src/backend/utils/activity/pgstat_database.c
> > index d9275611f0..d3963f59d0 100644
> > --- a/src/backend/utils/activity/pgstat_database.c
> > +++ b/src/backend/utils/activity/pgstat_database.c
> > @@ -47,7 +47,8 @@ pgstat_drop_database(Oid databaseid)
> > }
> >
> > /*
> > - * Called from autovacuum.c to report startup of an autovacuum process.
> > + * Called from autovacuum.c to report startup of an autovacuum process
> and
> > + * flush IO Operation statistics.
> > * We are called before InitPostgres is done, so can't rely on
> MyDatabaseId;
> > * the db OID must be passed in, instead.
> > */
> > @@ -72,6 +73,11 @@ pgstat_report_autovac(Oid dboid)
> > dbentry->stats.last_autovac_time = GetCurrentTimestamp();
> >
> > pgstat_unlock_entry(entry_ref);
> > +
> > + /*
> > + * Report IO Operation statistics
> > + */
> > + pgstat_flush_io_ops(false);
> > }
>
> Hm. I suspect this will always be zero - at this point we haven't
> connected to
> a database, so there really can't have been much, if any, IO. I think I
> suggested doing something here, but on a second look it really doesn't make
> much sense.
>
> Note that that's different from doing something in
> pgstat_report_(vacuum|analyze) - clearly we've done something at that
> point.
>

I've removed this.

>
> > /*
> > - * Report that the table was just vacuumed.
> > + * Report that the table was just vacuumed and flush IO Operation
> statistics.
> > */
> > void
> > pgstat_report_vacuum(Oid tableoid, bool shared,
> > @@ -257,10 +257,15 @@ pgstat_report_vacuum(Oid tableoid, bool shared,
> > }
> >
> > pgstat_unlock_entry(entry_ref);
> > +
> > + /*
> > + * Report IO Operations statistics
> > + */
> > + pgstat_flush_io_ops(false);
> > }
> >
> > /*
> > - * Report that the table was just analyzed.
> > + * Report that the table was just analyzed and flush IO Operation
> statistics.
> > *
> > * Caller must provide new live- and dead-tuples estimates, as well as a
> > * flag indicating whether to reset the changes_since_analyze counter.
> > @@ -340,6 +345,11 @@ pgstat_report_analyze(Relation rel,
> > }
> >
> > pgstat_unlock_entry(entry_ref);
> > +
> > + /*
> > + * Report IO Operations statistics
> > + */
> > + pgstat_flush_io_ops(false);
> > }
>
> Think it'd be good to amend these comments to say that otherwise stats
> would
> only get flushed after a multi-relatio autovacuum cycle is done / a
> VACUUM/ANALYZE command processed all tables. Perhaps add the comment to
> one
> of the two functions, and just reference it in the other place?
>

Done

>
>
> > --- a/src/include/utils/backend_status.h
> > +++ b/src/include/utils/backend_status.h
> > @@ -306,6 +306,40 @@ extern const char
> *pgstat_get_crashed_backend_activity(int pid, char *buffer,
> >
> int buflen);
> > extern uint64 pgstat_get_my_query_id(void);
> >
> > +/* Utility functions */
> > +
> > +/*
> > + * When maintaining an array of information about all valid
> BackendTypes, in
> > + * order to avoid wasting the 0th spot, use this helper to convert a
> valid
> > + * BackendType to a valid location in the array (given that no spot is
> > + * maintained for B_INVALID BackendType).
> > + */
> > +static inline int backend_type_get_idx(BackendType backend_type)
> > +{
> > + /*
> > + * backend_type must be one of the valid backend types. If caller
> is
> > + * maintaining backend information in an array that includes
> B_INVALID,
> > + * this function is unnecessary.
> > + */
> > + Assert(backend_type > B_INVALID && backend_type <=
> BACKEND_NUM_TYPES);
> > + return backend_type - 1;
> > +}
>
> In function definitions (vs declarations) we put the 'static inline int'
> in a
> separate line from the rest of the function signature.
>

Fixed.

>
> > +/*
> > + * When using a value from an array of information about all valid
> > + * BackendTypes, add 1 to the index before using it as a BackendType to
> adjust
> > + * for not maintaining a spot for B_INVALID BackendType.
> > + */
> > +static inline BackendType idx_get_backend_type(int idx)
> > +{
> > + int backend_type = idx + 1;
> > + /*
> > + * If the array includes a spot for B_INVALID BackendType this
> function is
> > + * not required.
>
> The comments around this seem a bit over the top, but I also don't mind
> them
> much.
>

Feel free to change them to something shorter. I couldn't think of
something I liked.

>
>
> > Add pg_stat_io, a system view which tracks the number of IOOp (allocs,
> > writes, fsyncs, and extends) done through each IOPath (e.g. shared
> > buffers, local buffers, unbuffered IO) by each type of backend.
>
> Annoying question: pg_stat_io vs pg_statio? I'd not think of suggesting the
> latter, except that we already have a bunch of views with that prefix.
>
>
I have thoughts on this but thought it best deferred until after the _data
decision.

>
> > Some of these should always be zero. For example, checkpointer does not
> > use a BufferAccessStrategy (currently), so the "strategy" IOPath for
> > checkpointer will be 0 for all IOOps.
>
> What do you think about returning NULL for the values that we except to
> never
> be non-zero? Perhaps with an assert against non-zero values? Seems like it
> might be helpful for understanding the view.
>

Yes, I like this idea.

Beyond just setting individual cells to NULL, if an entire row would be
NULL, I have now dropped it from the view.

So far, I have omitted from the view all rows for BackendTypes
B_ARCHIVER, B_LOGGER, and B_STARTUP.

Should I also omit rows for B_WAL_RECEIVER and B_WAL_WRITER for now?

I have also omitted rows for IOPATH_STRATEGY for all BackendTypes
*except* B_AUTOVAC_WORKER, B_BACKEND, B_STANDALONE_BACKEND, and
B_BG_WORKER.

Do these seem correct?

I think there are some BackendTypes which will never do IO Operations on
IOPATH_LOCAL but I am not sure which. Do you know which?

As for individual cells which should be NULL, so far what I have is:
- IOPATH_LOCAL + IOOP_FSYNC
I am sure there are others as well. Can you think of any?

>
> > +/*
> > +* When adding a new column to the pg_stat_io view, add a new enum
> > +* value here above IO_NUM_COLUMNS.
> > +*/
> > +enum
> > +{
> > + IO_COLUMN_BACKEND_TYPE,
> > + IO_COLUMN_IO_PATH,
> > + IO_COLUMN_ALLOCS,
> > + IO_COLUMN_EXTENDS,
> > + IO_COLUMN_FSYNCS,
> > + IO_COLUMN_WRITES,
> > + IO_COLUMN_RESET_TIME,
> > + IO_NUM_COLUMNS,
> > +};
>
> We typedef pretty much every enum so the enum can be referenced without the
> 'enum' prefix. I'd do that here, even if we don't need it.
>
>
So, I left it anonymous because I didn't want it being used as a type
or referenced anywhere else.

I am interested to hear more about your SQL enums idea from upthread.

- Melanie

Attachment Content-Type Size
v25-0001-Add-BackendType-for-standalone-backends.patch text/x-patch 2.7 KB
v25-0002-Remove-unneeded-call-to-pgstat_report_wal.patch text/x-patch 1.1 KB
v25-0003-Track-IO-operation-statistics.patch text/x-patch 33.5 KB
v25-0004-Add-system-view-tracking-IO-ops-per-backend-type.patch text/x-patch 18.2 KB

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Alvaro Herrera 2022-07-13 17:25:05 Re: make update-po@master stops at pg_upgrade
Previous Message Nathan Bossart 2022-07-13 17:09:50 optimize lookups in snapshot [sub]xip arrays