Re: pg_stat_bgwriter.buffers_backend is pretty meaningless (and more?)

From: Melanie Plageman <melanieplageman(at)gmail(dot)com>
To: Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>
Cc: Andres Freund <andres(at)anarazel(dot)de>, Magnus Hagander <magnus(at)hagander(dot)net>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>, lukas(at)fittl(dot)com, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>
Subject: Re: pg_stat_bgwriter.buffers_backend is pretty meaningless (and more?)
Date: 2021-04-13 02:49:36
Message-ID: CAAKRu_atgf8pnWw4So5Sbi-YhD+Vff-Ac10VEjdvE3By-nM3uA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Sun, Jan 26, 2020 at 11:21 PM Kyotaro Horiguchi
<horikyota(dot)ntt(at)gmail(dot)com> wrote:
> At Sun, 26 Jan 2020 12:22:03 -0800, Andres Freund <andres(at)anarazel(dot)de> wrote in
> > On 2020-01-26 16:20:03 +0100, Magnus Hagander wrote:
> > > On Sun, Jan 26, 2020 at 1:44 AM Andres Freund <andres(at)anarazel(dot)de> wrote:
> > > > On 2020-01-25 15:43:41 +0100, Magnus Hagander wrote:
> > > > > On Fri, Jan 24, 2020 at 8:52 PM Andres Freund <andres(at)anarazel(dot)de> wrote:
> > > > > > Lastly, I don't understand what the point of sending fixed size stats,
> > > > > > like the stuff underlying pg_stat_bgwriter, through pgstats IPC. While
> > > > > > I don't like it's architecture, we obviously need something like pgstat
> > > > > > to handle variable amounts of stats (database, table level etc
> > > > > > stats). But that doesn't at all apply to these types of global stats.
> > > > >
> > > > > That part has annoyed me as well a few times. +1 for just moving that
> > > > > into a global shared memory. Given that we don't really care about
> > > > > things being in sync between those different counters *or* if we loose
> > > > > a bit of data (which the stats collector is designed to do), we could
> > > > > even do that without a lock?
> > > >
> > > > I don't think we'd quite want to do it without any (single counter)
> > > > synchronization - high concurrency setups would be pretty likely to
> > > > loose values that way. I suspect the best would be to have a struct in
> > > > shared memory that contains the potential counters for each potential
> > > > process. And then sum them up when actually wanting the concrete
> > > > value. That way we avoid unnecessary contention, in contrast to having a
> > > > single shared memory value for each(which would just pingpong between
> > > > different sockets and store buffers). There's a few details like how
> > > > exactly to implement resetting the counters, but ...
> > >
> > > Right. Each process gets to do their own write, but still in shared
> > > memory. But do you need to lock them when reading them (for the
> > > summary)? That's the part where I figured you could just read and
> > > summarize them, and accept the possible loss.
> >
> > Oh, yea, I'd not lock for that. On nearly all machines aligned 64bit
> > integers can be read / written without a danger of torn values, and I
> > don't think we need perfect cross counter accuracy. To deal with the few
> > platforms without 64bit "single copy atomicity", we can just use
> > pg_atomic_read/write_u64. These days (e8fdbd58fe) they automatically
> > fall back to using locked operations for those platforms. So I don't
> > think there's actually a danger of loss.
> >
> > Obviously we could also use atomic ops to increment the value, but I'd
> > rather not add all those atomic operations, even if it's on uncontended
> > cachelines. It'd allow us to reset the backend values more easily by
> > just swapping in a 0, which we can't do if the backend increments
> > non-atomically. But I think we could instead just have one global "bias"
> > value to implement resets (by subtracting that from the summarized
> > value, and storing the current sum when resetting). Or use the new
> > global barrier to trigger a reset. Or something similar.
>
> Fixed or global stats are suitable for the startar of shared-memory
> stats collector. In the case of buffers_*_write, the global stats
> entry for each process needs just 8 bytes plus matbe extra 8 bytes for
> the bias value. I'm not sure how many counters like this there are,
> but is such size of footprint acceptatble? (Each backend already uses
> the same amount of local memory for pgstat use, though.)
>
> Anyway I will do something like that as a trial, maybe by adding a
> member in PgBackendStatus and one global-shared for the bial value.
>
> int64 st_progress_param[PGSTAT_NUM_PROGRESS_PARAM];
> + PgBackendStatsCounters counters;
> } PgBackendStatus;
>

So, I took a stab at implementing this in PgBackendStatus. The attached
patch is not quite on top of current master, so, alas, don't try and
apply it. I went to rebase today and realized I needed to make some
changes in light of e1025044cd4, however, I wanted to share this WIP so
that I could pose a few questions that I imagine will still be relevant
after I rewrite the patch.

I removed buffers_backend and buffers_backend_fsync from
pg_stat_bgwriter and have created a new view which tracks
- number of shared buffers the checkpointer and bgwriter write out
- number of shared buffers a regular backend is forced to flush
- number of extends done by a regular backend through shared buffers
- number of buffers flushed by a backend or autovacuum using a
BufferAccessStrategy which, were they not to use this strategy,
could perhaps have been avoided if a clean shared buffer was
available
- number of fsyncs done by a backend which could have been done by
checkpointer if sync queue had not been full

This view currently does only track writes and extends that go through
shared buffers and fsyncs of shared buffers (which, AFAIK are the only
things fsync'd though the SyncRequest machinery currently).

BufferAlloc() and SyncOneBuffer() are the main points at which the
tracking is done. I can definitely expand this, but, I want to make sure
that we are tracking the right kind of information.

num_backend_writes and num_backend_fsync were intended (though they were
not accurate) to count buffers that backends had to end up writing
themselves and fsyncs that backends had to end up doing themselves which
could have been avoided with a different configuration (or, I suppose, a
different workload/different data, etc). That is, they were meant to
tell you if checkpointer and bgwriter were keeping up and/or if the
size of shared buffers was adequate.

In implementing this counting per backend, it is easy for all types of
backends to keep track of the number of writes, extends, fsyncs, and
strategy writes they are doing. So, as recommended upthread, I have
added columns in the view for the number of writes for checkpointer and
bgwriter and others. Thus, this view becomes more than just stats on
"avoidable I/O done by backends".

So, my question is, does it makes sense to track all extends -- those to
extend the fsm and visimap and when making a new relation or index? Is
that information useful? If so, is it different than the extends done
through shared buffers? Should it be tracked separately?

Also, if we care about all of the extends, then it seems a bit annoying
to pepper the counting all over the place when it really just needs to
be done when smgrextend() — even though maybe a stats function doesn't
belong in that API.

Another question I have is, should the number of extends be for every
single block extended or should we try to track the initiation of a set
of extends (all of those added in RelationAddExtraBlocks(), in this
case)?

When it comes to fsync counting, I only count the fsyncs counted by the
previous code — that is fsyncs done by backends themselves when the
checkpointer sync request queue was full.
I did the counting in the same place in checkpointer code -- in
ForwardSyncRequest() -- partially because there did not seem to be
another good place to do it since register_dirty_segment() returns void
(thought about having it return a bool to indicate if it fsync'd it or
if it registered the fsync because that seemed alright, but mdextend(),
mdwrite() etc, also return NULL) so there is no way to propagate the
information back up to the bufmgr that the process had to do its own
fsync, so, that means that I would have to muck with the md.c API. and,
since the checkpointer is the one processing these sync requests anyway,
it actually seems okay to do it in the checkpointer code.

I'm not counting fsyncs that are "unavoidable" in the sense that they
couldn't be avoided by changing settings/workload etc -- like those done
when building an index, creating a table/rewriting a table/copying a
table -- is it useful to count these? It seems like it makes the number
of "avoidable fsyncs by backends" less useful if we count the others.
Also, should we count how many fsyncs checkpointer has done (have to
check if there is already a stat for that)? Is that useful in this
context?

Of course, this view, when grown, will begin to overlap with pg_statio,
which is another consideration. What is its identity? I would find
"avoidable I/O" either avoidable entirely or avoidable for that
particular type of process, to be useful.

Or maybe, it should have a more expansive mandate. Maybe it would be
useful to aggregate some of the info from pg_stat_statements at a higher
level -- like maybe shared_blks_read counted across many statements for
a period of time/context in which we expected the relation in shared
buffers becomes potentially interesting.

As for the way I have recorded strategy writes -- it is quite inelegant,
but, I wanted to make sure that I only counted a strategy write as one
in which the backend wrote out the dirty buffer from its strategy ring
but did not check if there was any clean buffer in shared buffers more
generally (so, it is *potentially* an avoidable write). I'm not sure if
this distinction is useful to anyone. I haven't done enough with
BufferAccessStrategies to know what I'd want to know about them when
developing or using Postgres. However, if I don't need to be so careful,
it will make the code much simpler (though, I'm sure I can improve the
code regardless).

As for the implementation of the counters themselves, I appreciate that
it isn't very nice to have a bunch of random members in PgBackendStatus
to count all of these write, extends, fsyncs. I considered if I could
add params that were used for all command types to st_progress_param but
I haven't looked into it yet. Alternatively, I could create an array
just for these kind of stats in PgBackendStatus. Though, I imagine that
I should take a look at the changes that have been made recently to this
area and at the shared memory stats patch.

Oh, also, there should be a way to reset the stats, especially if we add
more extends and fsyncs that happen at the time of relation/index
creation. I, at least, would find it useful to see these numbers once
the database is at some kind of steady state.

Oh and src/test/regress/sql/stats.sql will fail and, of course, I don't
intend to add that SELECT from the view to regress, it was just for
testing purposes to make sure the view was working.

-- Melanie

Attachment Content-Type Size
v1-0001-Add-system-view-tracking-shared-buffers-written.patch application/octet-stream 30.8 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Peter Smith 2021-04-13 02:56:23 Re: [HACKERS] logical decoding of two-phase transactions
Previous Message Craig Ringer 2021-04-13 02:34:18 Re: [PATCH] Identify LWLocks in tracepoints