Re: Introduce a new view for checkpointer related stats

From: Bharath Rupireddy <bharath(dot)rupireddyforpostgres(at)gmail(dot)com>
To: Michael Paquier <michael(at)paquier(dot)xyz>
Cc: Andres Freund <andres(at)anarazel(dot)de>, "Drouvot, Bertrand" <bertranddrouvot(dot)pg(at)gmail(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: Introduce a new view for checkpointer related stats
Date: 2023-10-27 04:53:34
Message-ID: CALj2ACWekrk9D0Smb3oUc1rOsfp-ta7TE2Ez20=GYpPm97iWmQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Fri, Oct 27, 2023 at 8:03 AM Michael Paquier <michael(at)paquier(dot)xyz> wrote:
>
> Hmm. As per the existing call of pgstat_count_slru_flush() in
> SimpleLruWriteAll(), routine called SimpleLruFlush() until ~13 and
> dee663f78439, an incrementation of 1 for slru_stats_idx refers to all
> the flushes for all the dirty data of this SLRU in a single pass.
>
> This addition actually means that we would now increment the counter
> for each sync request, changing its meaning. Perhaps there is an
> argument for changing the meaning of this parameter to be based on the
> number of flush requests completed, but if that were to happen it
> would be better to remove pgstat_count_slru_flush() in
> SimpleLruWriteAll(), I guess, or just aggregate this counter once,
> which would be cheaper.

Right. Interestingly, there are 2 SLRU flush related wait events
WAIT_EVENT_SLRU_SYNC ("Waiting for SLRU data to reach durable storage
following a page write") and WAIT_EVENT_SLRU_FLUSH_SYNC ("Waiting for
SLRU data to reach durable storage during a checkpoint or database
shutdown"). And, we're counting the SLRU flushes in two of these
places into one single stat variable. These two events look confusing
and may be useful if shown in an aggregated way.

A possible way is to move existing pgstat_count_slru_flush in
SimpleLruWriteAll closer to pg_fsync and WAIT_EVENT_SLRU_SYNC in
SlruPhysicalWritePage, remove WAIT_EVENT_SLRU_FLUSH_SYNC completely,
use WAIT_EVENT_SLRU_SYNC in SlruSyncFileTag and count the flushes in
SlruSyncFileTag. This aggregated way is much simpler IMV.

Another possible way is to have separate stat variables for each of
the SLRU flushes WAIT_EVENT_SLRU_SYNC and WAIT_EVENT_SLRU_FLUSH_SYNC
and expose them separately in pg_stat_slru. I don't like this
approach.

> v9-0003 is mostly a mechanical change, and it passes the eye test.

Thanks. Indeed it contains mechanical changes.

> I have spotted two nits.
>
> +CREATE VIEW pg_stat_checkpointer AS
> + SELECT
> + pg_stat_get_checkpointer_num_timed() AS num_timed,
> + pg_stat_get_checkpointer_num_requested() AS num_requested,
> + pg_stat_get_checkpointer_write_time() AS write_time,
> + pg_stat_get_checkpointer_sync_time() AS sync_time,
> + pg_stat_get_checkpointer_buffers_written() AS buffers_written,
> + pg_stat_get_checkpointer_stat_reset_time() AS stats_reset;
>
> Okay with this layer. I am wondering if others have opinions to
> share about these names for the attributes of pg_stat_checkpointer,
> though.

I think these column names are a good fit in this context as we can't
just call timed/requested/write/sync and having "checkpoint" makes the
column names long unnecessarily. FWIW, I see some of the user-exposed
field names starting with num_* (num_nonnulls, num_nulls, num_lwlocks,
num_transactions).

> - single row, containing global data for the cluster.
> + single row, containing data about the bgwriter of the cluster.
>
> "bgwriter" is used in one place of the docs, so perhaps "background
> writer" is better here?

+1. Changed in the attached v10-0001.

> The error message generated for an incorrect target needs to be
> updated in pg_stat_reset_shared():
> =# select pg_stat_reset_shared('checkpointe');
> ERROR: 22023: unrecognized reset target: "checkpointe"
> HINT: Target must be "archiver", "bgwriter", "io", "recovery_prefetch", or "wal".

+1. Changed in the attached v10-001. FWIW, having a test case in
stats.sql emitting this error message and hint would have helped here.
If okay, I can add one.

PS: I'll park the SLRU flush related patch aside until the approach is
finalized. I'm posting the pg_stat_checkpointer patch as v10-0001.

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Attachment Content-Type Size
v10-0001-Introduce-a-new-view-for-checkpointer-related-st.patch application/x-patch 27.6 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Bharath Rupireddy 2023-10-27 04:59:28 Re: A recent message added to pg_upgade
Previous Message Noah Misch 2023-10-27 04:44:04 Re: race condition in pg_class