Re: pg_stat_bgwriter.buffers_backend is pretty meaningless (and more?)

From: Melanie Plageman <melanieplageman(at)gmail(dot)com>
To: Maciek Sakrejda <m(dot)sakrejda(at)gmail(dot)com>
Cc: Lukas Fittl <lukas(at)fittl(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Justin Pryzby <pryzby(at)telsasoft(dot)com>, Alvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org>, Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>, Magnus Hagander <magnus(at)hagander(dot)net>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>
Subject: Re: pg_stat_bgwriter.buffers_backend is pretty meaningless (and more?)
Date: 2022-10-19 19:26:51
Message-ID: CAAKRu_agmNMyMbfffGegbgFpUUTL=k3U64n_v0gLQGW5DeDkhg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

v34 is attached.
I think the column names need discussion. Also, the docs need more work
(I added a lot of new content there). I could use feedback on the column
names and definitions and review/rephrasing ideas for the docs
additions.

On Mon, Oct 17, 2022 at 1:28 AM Maciek Sakrejda <m(dot)sakrejda(at)gmail(dot)com> wrote:
>
> On Thu, Oct 13, 2022 at 10:29 AM Melanie Plageman
> <melanieplageman(at)gmail(dot)com> wrote:
> > I think that it makes sense to count both the initial buffers added to
> > the ring and subsequent shared buffers added to the ring (either when
> > the current strategy buffer is pinned or in use or when a bulkread
> > rejects dirty strategy buffers in favor of new shared buffers) as
> > strategy clocksweeps because of how the statistic would be used.
> >
> > Clocksweeps give you an idea of how much of your working set is cached
> > (setting aside initially reading data into shared buffers when you are
> > warming up the db). You may use clocksweeps to determine if you need to
> > make shared buffers larger.
> >
> > Distinguishing strategy buffer clocksweeps from shared buffer
> > clocksweeps allows us to avoid enlarging shared buffers if most of the
> > clocksweeps are to bring in blocks for the strategy operation.
> >
> > However, I could see an argument that discounting strategy clocksweeps
> > done because the current strategy buffer is pinned makes the number of
> > shared buffer clocksweeps artificially low since those other queries
> > using the buffer would have suffered a cache miss were it not for the
> > strategy. And, in this case, you would take strategy clocksweeps
> > together with shared clocksweeps to make your decision. And if we
> > include buffers initially added to the strategy ring in the strategy
> > clocksweep statistic, this number may be off because those blocks may
> > not be needed in the main shared working set. But you won't know that
> > until you try to reuse the buffer and it is pinned. So, I think we don't
> > have a better option than counting initial buffers added to the ring as
> > strategy clocksweeps (as opposed to as reuses).
> >
> > So, in answer to your question, no, I cannot think of a scenario like
> > that.
>
> That analysis makes sense to me; thanks.

I have made some major changes in this area to make the columns more
useful. I have renamed and split "clocksweeps". It is now "evicted" and
"freelist acquired". This makes it clear when a block must be evicted
from a shared buffer must be and may help to identify misconfiguration
of shared buffers.

There is some nuance here that I tried to make clear in the docs.
"freelist acquired" in a shared context is straightforward.
"freelist acquired" in a strategy context is counted when a shared
buffer is added to the strategy ring (not when it is reused).

"freelist acquired" in the local buffer context is actually the initial
allocation of a local buffer (in contrast with reuse).

"evicted" in the shared IOContext is a block being evicted from a shared
buffer in order to reuse that buffer when not using a strategy.

"evicted" in a strategy IOContext is a block being evicted from
a shared buffer in order to add that shared buffer to the strategy ring.

This is in contrast with "reused" in a strategy IOContext which is when
an existing buffer in the strategy ring has a block evicted in order to
reuse that buffer in a strategy context.

"evicted" in a local IOContext is when an existing local buffer has a
block evicted in order to reuse that local buffer.

"freelist_acquired" is confusing for local buffers but I wanted to
distinguish between reuse/eviction of local buffers and initial
allocation. "freelist_acquired" seemed more fitting because there is a
clocksweep to find a local buffer and if it hasn't been allocated yet it
is allocated in a place similar to where shared buffers acquire a buffer
from the freelist. If I didn't count it here, I would need to make a new
column only for local buffers called "allocated" or something like that.

I chose not to call "evicted" "sb_evicted"
because then we would need a separate "local_evicted". I could instead
make "local_evicted", "sb_evicted", and rename "reused" to
"strat_evicted". If I did that we would end up with separate columns for
every IO Context describing behavior when a buffer is initially acquired
vs when it is reused.

It would look something like this:

shared buffers:
initial: freelist_acquired
reused: sb_evicted

local buffers:
initial: allocated
reused: local_evicted

strategy buffers:
initial: sb_evicted | freelist_acquired
reused: strat_evicted
replaced: sb_evicted | freelist_acquired

This seems not too bad at first, but if you consider that later we will
add other kinds of IO -- eg WAL IO or temporary file IO, we won't be
able to use these existing columns and will need to add even more
columns describing the exact behavior in those cases.

I wanted to devise a paradigm which allowed for reuse of columns across
IOContexts even if with slightly different meanings.

I have also added the columns "repossessed" and "rejected". "rejected"
is when a bulkread rejects a strategy buffer because it is dirty and
requires flush. Seeing a lot of rejections could indicate you need to
vacuum. "repossessed" is the number of times a strategy buffer was
pinned or in use by another backend and had to be removed from the
strategy ring and replaced with a new shared buffer. This gives you some
indication that there is contention on blocks recently used by a
strategy.

I've also added some descriptions to the docs of how these columns might
be used or what a large value in one of them may mean.

I haven't added tests for repossessed or rejected yet. I can add tests
for repossessed if we decide to keep it. Rejected is hard to write a
test for because we can't guarantee checkpointer won't clean up the
buffer before we can reject it

>
> > It also made me remember that I am incorrectly counting rejected buffers
> > as reused. I'm not sure if it is a good idea to subtract from reuses
> > when a buffer is rejected. Waiting until after it is rejected to count
> > the reuse will take some other code changes. Perhaps we could also count
> > rejections in the stats?
>
> I'm not sure what makes sense here.

I have fixed the counting of rejected and have made a new column
dedicated to rejected.

>
> > > From the io_context column description:
> > >
> > > + The autovacuum daemon, explicit <command>VACUUM</command>,
> > > explicit
> > > + <command>ANALYZE</command>, many bulk reads, and many bulk
> > > writes use a
> > > + fixed amount of memory, acquiring the equivalent number of
> > > shared
> > > + buffers and reusing them circularly to avoid occupying an
> > > undue portion
> > > + of the main shared buffer pool.
> > > + </para></entry>
> > >
> > > I don't understand how this is relevant to the io_context column.
> > > Could you expand on that, or am I just missing something obvious?
> > >
> >
> > I'm trying to explain why those other IO Contexts exist (bulkread,
> > bulkwrite, vacuum) and why they are separate from shared buffers.
> > Should I cut it altogether or preface it with something like: these are
> > counted separate from shared buffers because...?
>
> Oh I see. That makes sense; it just wasn't obvious to me this was
> talking about the last three values of io_context. I think a brief
> preface like that would be helpful (maybe explicitly with "these last
> three values", and I think "counted separately").

I've done this. Thanks for the suggested wording.

>
> > > + <row>
> > > + <entry role="catalog_table_entry"><para
> > > role="column_definition">
> > > + <structfield>extended</structfield> <type>bigint</type>
> > > + </para>
> > > + <para>
> > > + Extends of relations done by this
> > > <varname>backend_type</varname> in
> > > + order to write data in this <varname>io_context</varname>.
> > > + </para></entry>
> > > + </row>
> > >
> > > I understand what this is, but not why this is something I might want
> > > to know about.
> >
> > Unlike writes, backends largely have to do their own extends, so
> > separating this from writes lets us determine whether or not we need to
> > change checkpointer/bgwriter to be more aggressive using the writes
> > without the distraction of the extends. Should I mention this in the
> > docs? The other stats views don't seems to editorialize at all, and I
> > wasn't sure if this was an objective enough point to include in docs.
>
> Thanks for the clarification. Just to make sure I understand, you mean
> that if I see a high extended count, that may be interesting in terms
> of write activity, but I can't fix that by tuning--it's just the
> nature of my workload?

That is correct.

>
> > > That seems broadly reasonable, but pg_settings also has a 'unit'
> > > field, and in that view, unit is '8kB' on my system--i.e., it
> > > (presumably) reflects the block size. Is that something we should try
> > > to be consistent with (not sure if that's a good idea, but thought it
> > > was worth asking)?
> > >
> >
> > I think this idea is a good option. I am wondering if it would be clear
> > when mixed with non-block-oriented IO. Block-oriented IO would say 8kB
> > (or whatever the build-time value of a block was) and non-block-oriented
> > IO would say B or kB. The math would work out.
>
> Right, yeah. Although maybe that's a little confusing? When you
> originally added "unit", you had said:
>
> >The most correct thing to do to accommodate block-oriented and
> >non-block-oriented IO would be to specify all the values in bytes.
> >However, I would like this view to be usable visually (as opposed to
> >just in scripts and by tools). The only current value of unit is
> >"block_size" which could potentially be combined with the value of the
> >GUC to get bytes.
>
> Is this still usable visually if you have to compare values across
> units? I don't really have any great ideas here (and maybe this is
> still the best option), just pointing it out.
>
> > Looking at pg_settings now though, I am confused about
> > how the units for wal_buffers is 8kB but then the value of wal_buffers
> > when I show it in psql is "16MB"...
>
> You mean the difference between
>
> maciek=# select setting, unit from pg_settings where name = 'wal_buffers';
> setting | unit
> ---------+------
> 512 | 8kB
> (1 row)
>
> and
>
> maciek=# show wal_buffers;
> wal_buffers
> -------------
> 4MB
> (1 row)
>
> ?
>
> Poking around, I think it looks like that's due to
> convert_int_from_base_unit (indirectly called from SHOW /
> current_setting):
>
> /*
> * Convert an integer value in some base unit to a human-friendly
> unit.
> *
> * The output unit is chosen so that it's the greatest unit that can
> represent
> * the value without loss. For example, if the base unit is
> GUC_UNIT_KB, 1024
> * is converted to 1 MB, but 1025 is represented as 1025 kB.
> */

I've implemented a change using the same function pg_settings uses to
turn the build-time parameter BLCKSZ into 8kB (get_config_unit_name())
using the flag GUC_UNIT_BLOCKS. I am unsure if this is better or worse
than "block_size". I am feeling very conflicted about this column.

>
> > Though the units for the pg_stat_io view for block-oriented IO would be
> > the build-time values for block size, so it wouldn't line up exactly
> > with pg_settings.
>
> I don't follow--what would be the discrepancy?

I got confused.
You are right -- pg_settings does seem to use the build-time value of
BLCKSZ to derive this. I was confused because the description of
pg_settings says:

"The view pg_settings provides access to run-time parameters of the server."

- Melanie

Attachment Content-Type Size
v34-0001-Remove-BufferAccessStrategyData-current_was_in_r.patch text/x-patch 3.8 KB
v34-0003-Aggregate-IO-operation-stats-per-BackendType.patch text/x-patch 22.3 KB
v34-0002-Track-IO-operation-statistics-locally.patch text/x-patch 31.7 KB
v34-0004-Add-system-view-tracking-IO-ops-per-backend-type.patch text/x-patch 46.3 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Peter Geoghegan 2022-10-19 19:37:30 Re: shared memory stats ideas
Previous Message Andres Freund 2022-10-19 19:21:30 Re: problems with making relfilenodes 56-bits