Re: pg_stat_bgwriter.buffers_backend is pretty meaningless (and more?)

From: Andres Freund <andres(at)anarazel(dot)de>
To: Melanie Plageman <melanieplageman(at)gmail(dot)com>
Cc: Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>, Justin Pryzby <pryzby(at)telsasoft(dot)com>, Alvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org>, Magnus Hagander <magnus(at)hagander(dot)net>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>, Lukas Fittl <lukas(at)fittl(dot)com>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>
Subject: Re: pg_stat_bgwriter.buffers_backend is pretty meaningless (and more?)
Date: 2022-07-12 17:01:31
Message-ID: 20220712170131.fql5addovb2oiw7w@awork3.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

On 2022-07-12 12:19:06 -0400, Melanie Plageman wrote:
> > > I also realized that I am not differentiating between IOPATH_SHARED and
> > > IOPATH_STRATEGY for IOOP_FSYNC. But, given that we don't know what type
> > > of buffer we are fsync'ing by the time we call register_dirty_segment(),
> > > I'm not sure how we would fix this.
> >
> > I think there scarcely happens flush for strategy-loaded buffers. If
> > that is sensible, IOOP_FSYNC would not make much sense for
> > IOPATH_STRATEGY.
> >
>
> Why would it be less likely for a backend to do its own fsync when
> flushing a dirty strategy buffer than a regular dirty shared buffer?

We really just don't expect a backend to do many segment fsyncs at
all. Otherwise there's something wrong with the forwarding mechanism.

It'd be different if we tracked WAL fsyncs more granularly - which would be
quite interesting - but that's something for another day^Wpatch.

> > > > Wonder if it's worth making the lock specific to the backend type?
> > > >
> > >
> > > I've added another Lock into PgStat_IOPathOps so that each BackendType
> > > can be locked separately. But, I've also kept the lock in
> > > PgStatShared_BackendIOPathOps so that reset_all and snapshot could be
> > > done easily.
> >
> > Looks fine about the lock separation.
> >
>
> Actually, I think it is not safe to use both of these locks. So for
> picking one method, it is probably better to go with the locks in
> PgStat_IOPathOps, it will be more efficient for flush (and not for
> fetching and resetting), so that is probably the way to go, right?

I think it's good to just use one kind of lock, and efficiency of snapshotting
/ resetting is nearly irrelevant. But I don't see why it's not safe to use
both kinds of locks?

> > Looks fine, but I think pgstat_flush_io_ops() need more comments like
> > other pgstat_flush_* functions.
> >
> > + for (int i = 0; i < BACKEND_NUM_TYPES; i++)
> > + stats_shmem->stats[i].stat_reset_timestamp = ts;
> >
> > I'm not sure we need a separate reset timestamp for each backend type
> > but SLRU counter does the same thing..
> >
>
> Yes, I think for SLRU stats it is because you can reset individual SLRU
> stats. Also there is no wrapper data structure to put it in. I could
> keep it in PgStatShared_BackendIOPathOps since you have to reset all IO
> operation stats at once, but I am thinking of getting rid of
> PgStatShared_BackendIOPathOps since it is not needed if I only keep the
> locks in PgStat_IOPathOps and make the global shared value an array of
> PgStat_IOPathOps.

I'm strongly against introducing super granular reset timestamps. I think that
was a mistake for SLRU stats, but we can't fix that as easily.

> Currently, strategy allocs count only reuses of a strategy buffer (not
> initial shared buffers which are added to the ring).
> strategy writes count only the writing out of dirty buffers which are
> already in the ring and are being reused.

That seems right to me.

> Alternatively, we could also count as strategy allocs all those buffers
> which are added to the ring and count as strategy writes all those
> shared buffers which are dirty when initially added to the ring.

I don't think that'd provide valuable information. The whole reason that
strategy writes are interesting is that they can lead to writing out data a
lot sooner than they would be written out without a strategy being used.

> Subject: [PATCH v24 2/3] Track IO operation statistics
>
> Introduce "IOOp", an IO operation done by a backend, and "IOPath", the
> location or type of IO done by a backend. For example, the checkpointer
> may write a shared buffer out. This would be counted as an IOOp write on
> an IOPath IOPATH_SHARED by BackendType "checkpointer".

I'm still not 100% happy with IOPath - seems a bit too easy to confuse with
the file path. What about 'origin'?

> Each IOOp (alloc, fsync, extend, write) is counted per IOPath
> (direct, local, shared, or strategy) through a call to
> pgstat_count_io_op().

It seems we should track reads too - it's quite interesting to know whether
reads happened because of a strategy, for example. You do reference reads in a
later part of the commit message even :)

> The primary concern of these statistics is IO operations on data blocks
> during the course of normal database operations. IO done by, for
> example, the archiver or syslogger is not counted in these statistics.

We could extend this at a later stage, if we really want to. But I'm not sure
it's interesting or fully possible. E.g. the archiver's write are largely not
done by the archiver itself, but by a command (or module these days) it shells
out to.

> Note that this commit does not add code to increment IOPATH_DIRECT. A
> future patch adding wrappers for smgrwrite(), smgrextend(), and
> smgrimmedsync() would provide a good location to call
> pgstat_count_io_op() for unbuffered IO and avoid regressions for future
> users of these functions.

Hm. Perhaps we should defer introducing IOPATH_DIRECT for now then?

> Stats on IOOps for all IOPaths for a backend are initially accumulated
> locally.
>
> Later they are flushed to shared memory and accumulated with those from
> all other backends, exited and live.

Perhaps mention here that this later could be extended to make per-connection
stats visible?

> Some BackendTypes will not execute pgstat_report_stat() and thus must
> explicitly call pgstat_flush_io_ops() in order to flush their backend
> local IO operation statistics to shared memory.

Maybe add "flush ... during ongoing operation" or such? Because they'd all
flush at commit, IIRC.

> diff --git a/src/backend/bootstrap/bootstrap.c b/src/backend/bootstrap/bootstrap.c
> index 088556ab54..963b05321e 100644
> --- a/src/backend/bootstrap/bootstrap.c
> +++ b/src/backend/bootstrap/bootstrap.c
> @@ -33,6 +33,7 @@
> #include "miscadmin.h"
> #include "nodes/makefuncs.h"
> #include "pg_getopt.h"
> +#include "pgstat.h"
> #include "storage/bufmgr.h"
> #include "storage/bufpage.h"
> #include "storage/condition_variable.h"

Hm?

> diff --git a/src/backend/postmaster/walwriter.c b/src/backend/postmaster/walwriter.c
> index e926f8c27c..beb46dcb55 100644
> --- a/src/backend/postmaster/walwriter.c
> +++ b/src/backend/postmaster/walwriter.c
> @@ -293,18 +293,7 @@ HandleWalWriterInterrupts(void)
> }
>
> if (ShutdownRequestPending)
> - {
> - /*
> - * Force reporting remaining WAL statistics at process exit.
> - *
> - * Since pgstat_report_wal is invoked with 'force' is false in main
> - * loop to avoid overloading the cumulative stats system, there may
> - * exist unreported stats counters for the WAL writer.
> - */
> - pgstat_report_wal(true);
> -
> proc_exit(0);
> - }
>
> /* Perform logging of memory contexts of this process */
> if (LogMemoryContextPending)

Let's do this in a separate commit and get it out of the way...

> @@ -682,16 +694,37 @@ AddBufferToRing(BufferAccessStrategy strategy, BufferDesc *buf)
> * if this buffer should be written and re-used.
> */
> bool
> -StrategyRejectBuffer(BufferAccessStrategy strategy, BufferDesc *buf)
> +StrategyRejectBuffer(BufferAccessStrategy strategy, BufferDesc *buf, bool *write_from_ring)
> {
> - /* We only do this in bulkread mode */
> +
> + /*
> + * We only reject reusing and writing out the strategy buffer this in
> + * bulkread mode.
> + */
> if (strategy->btype != BAS_BULKREAD)
> + {
> + /*
> + * If the buffer was from the ring and we are not rejecting it, consider it
> + * a write of a strategy buffer.
> + */
> + if (strategy->current_was_in_ring)
> + *write_from_ring = true;

Hm. This is set even if the buffer wasn't dirty? I guess we don't expect
StrategyRejectBuffer() to be called for clean buffers...

> /*
> diff --git a/src/backend/utils/activity/pgstat_database.c b/src/backend/utils/activity/pgstat_database.c
> index d9275611f0..d3963f59d0 100644
> --- a/src/backend/utils/activity/pgstat_database.c
> +++ b/src/backend/utils/activity/pgstat_database.c
> @@ -47,7 +47,8 @@ pgstat_drop_database(Oid databaseid)
> }
>
> /*
> - * Called from autovacuum.c to report startup of an autovacuum process.
> + * Called from autovacuum.c to report startup of an autovacuum process and
> + * flush IO Operation statistics.
> * We are called before InitPostgres is done, so can't rely on MyDatabaseId;
> * the db OID must be passed in, instead.
> */
> @@ -72,6 +73,11 @@ pgstat_report_autovac(Oid dboid)
> dbentry->stats.last_autovac_time = GetCurrentTimestamp();
>
> pgstat_unlock_entry(entry_ref);
> +
> + /*
> + * Report IO Operation statistics
> + */
> + pgstat_flush_io_ops(false);
> }

Hm. I suspect this will always be zero - at this point we haven't connected to
a database, so there really can't have been much, if any, IO. I think I
suggested doing something here, but on a second look it really doesn't make
much sense.

Note that that's different from doing something in
pgstat_report_(vacuum|analyze) - clearly we've done something at that point.

> /*
> - * Report that the table was just vacuumed.
> + * Report that the table was just vacuumed and flush IO Operation statistics.
> */
> void
> pgstat_report_vacuum(Oid tableoid, bool shared,
> @@ -257,10 +257,15 @@ pgstat_report_vacuum(Oid tableoid, bool shared,
> }
>
> pgstat_unlock_entry(entry_ref);
> +
> + /*
> + * Report IO Operations statistics
> + */
> + pgstat_flush_io_ops(false);
> }
>
> /*
> - * Report that the table was just analyzed.
> + * Report that the table was just analyzed and flush IO Operation statistics.
> *
> * Caller must provide new live- and dead-tuples estimates, as well as a
> * flag indicating whether to reset the changes_since_analyze counter.
> @@ -340,6 +345,11 @@ pgstat_report_analyze(Relation rel,
> }
>
> pgstat_unlock_entry(entry_ref);
> +
> + /*
> + * Report IO Operations statistics
> + */
> + pgstat_flush_io_ops(false);
> }

Think it'd be good to amend these comments to say that otherwise stats would
only get flushed after a multi-relatio autovacuum cycle is done / a
VACUUM/ANALYZE command processed all tables. Perhaps add the comment to one
of the two functions, and just reference it in the other place?

> --- a/src/include/utils/backend_status.h
> +++ b/src/include/utils/backend_status.h
> @@ -306,6 +306,40 @@ extern const char *pgstat_get_crashed_backend_activity(int pid, char *buffer,
> int buflen);
> extern uint64 pgstat_get_my_query_id(void);
>
> +/* Utility functions */
> +
> +/*
> + * When maintaining an array of information about all valid BackendTypes, in
> + * order to avoid wasting the 0th spot, use this helper to convert a valid
> + * BackendType to a valid location in the array (given that no spot is
> + * maintained for B_INVALID BackendType).
> + */
> +static inline int backend_type_get_idx(BackendType backend_type)
> +{
> + /*
> + * backend_type must be one of the valid backend types. If caller is
> + * maintaining backend information in an array that includes B_INVALID,
> + * this function is unnecessary.
> + */
> + Assert(backend_type > B_INVALID && backend_type <= BACKEND_NUM_TYPES);
> + return backend_type - 1;
> +}

In function definitions (vs declarations) we put the 'static inline int' in a
separate line from the rest of the function signature.

> +/*
> + * When using a value from an array of information about all valid
> + * BackendTypes, add 1 to the index before using it as a BackendType to adjust
> + * for not maintaining a spot for B_INVALID BackendType.
> + */
> +static inline BackendType idx_get_backend_type(int idx)
> +{
> + int backend_type = idx + 1;
> + /*
> + * If the array includes a spot for B_INVALID BackendType this function is
> + * not required.

The comments around this seem a bit over the top, but I also don't mind them
much.

> Add pg_stat_io, a system view which tracks the number of IOOp (allocs,
> writes, fsyncs, and extends) done through each IOPath (e.g. shared
> buffers, local buffers, unbuffered IO) by each type of backend.

Annoying question: pg_stat_io vs pg_statio? I'd not think of suggesting the
latter, except that we already have a bunch of views with that prefix.

> Some of these should always be zero. For example, checkpointer does not
> use a BufferAccessStrategy (currently), so the "strategy" IOPath for
> checkpointer will be 0 for all IOOps.

What do you think about returning NULL for the values that we except to never
be non-zero? Perhaps with an assert against non-zero values? Seems like it
might be helpful for understanding the view.

> +/*
> +* When adding a new column to the pg_stat_io view, add a new enum
> +* value here above IO_NUM_COLUMNS.
> +*/
> +enum
> +{
> + IO_COLUMN_BACKEND_TYPE,
> + IO_COLUMN_IO_PATH,
> + IO_COLUMN_ALLOCS,
> + IO_COLUMN_EXTENDS,
> + IO_COLUMN_FSYNCS,
> + IO_COLUMN_WRITES,
> + IO_COLUMN_RESET_TIME,
> + IO_NUM_COLUMNS,
> +};

We typedef pretty much every enum so the enum can be referenced without the
'enum' prefix. I'd do that here, even if we don't need it.

Greetings,

Andres Freund

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Andres Freund 2022-07-12 17:09:44 Re: making relfilenodes 56 bits
Previous Message Melanie Plageman 2022-07-12 16:19:06 Re: pg_stat_bgwriter.buffers_backend is pretty meaningless (and more?)