Re: Report checkpoint progress with pg_stat_progress_checkpoint (was: Report checkpoint progress in server logs)

From: Ashutosh Sharma <ashu(dot)coek88(at)gmail(dot)com>
To: Nitin Jadhav <nitinjadhavpostgres(at)gmail(dot)com>
Cc: Matthias van de Meent <boekewurm+postgres(at)gmail(dot)com>, Bharath Rupireddy <bharath(dot)rupireddyforpostgres(at)gmail(dot)com>, Julien Rouhaud <rjuju123(at)gmail(dot)com>, Bruce Momjian <bruce(at)momjian(dot)us>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Magnus Hagander <magnus(at)hagander(dot)net>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: Report checkpoint progress with pg_stat_progress_checkpoint (was: Report checkpoint progress in server logs)
Date: 2022-03-03 15:00:10
Message-ID: CAE9k0PnJ42A+_B8uJ0CbQC_SvF9KNT+5LbgXdQOzV0_KuGJZmA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Here are some of my review comments on the latest patch:

+ <row>
+ <entry role="catalog_table_entry"><para role="column_definition">
+ <structfield>type</structfield> <type>text</type>
+ </para>
+ <para>
+ Type of checkpoint. See <xref linkend="checkpoint-types"/>.
+ </para></entry>
+ </row>
+
+ <row>
+ <entry role="catalog_table_entry"><para role="column_definition">
+ <structfield>kind</structfield> <type>text</type>
+ </para>
+ <para>
+ Kind of checkpoint. See <xref linkend="checkpoint-kinds"/>.
+ </para></entry>
+ </row>

This looks a bit confusing. Two columns, one with the name "checkpoint
types" and another "checkpoint kinds". You can probably rename
checkpoint-kinds to checkpoint-flags and let the checkpoint-types be
as-it-is.

==

+ <entry><structname>pg_stat_progress_checkpoint</structname><indexterm><primary>pg_stat_progress_checkpoint</primary></indexterm></entry>
+ <entry>One row only, showing the progress of the checkpoint.

Let's make this message consistent with the already existing message
for pg_stat_wal_receiver. See description for pg_stat_wal_receiver
view in "Dynamic Statistics Views" table.

==

[local]:5432 ashu(at)postgres=# select * from pg_stat_progress_checkpoint;
-[ RECORD 1 ]-----+-------------------------------------
pid | 22043
type | checkpoint
kind | immediate force wait requested time

I think the output in the kind column can be displayed as {immediate,
force, wait, requested, time}. By the way these are all checkpoint
flags so it is better to display it as checkpoint flags instead of
checkpoint kind as mentioned in one of my previous comments.

==

[local]:5432 ashu(at)postgres=# select * from pg_stat_progress_checkpoint;
-[ RECORD 1 ]-----+-------------------------------------
pid | 22043
type | checkpoint
kind | immediate force wait requested time
start_lsn | 0/14C60F8
start_time | 2022-03-03 18:59:56.018662+05:30
phase | performing two phase checkpoint

This is the output I see when the checkpointer process has come out of
the two phase checkpoint and is currently writing checkpoint xlog
records and doing other stuff like updating control files etc. Is this
okay?

==

The output of log_checkpoint shows the number of buffers written is 3
whereas the output of pg_stat_progress_checkpoint shows it as 0. See
below:

2022-03-03 20:04:45.643 IST [22043] LOG: checkpoint complete: wrote 3
buffers (0.0%); 0 WAL file(s) added, 0 removed, 0 recycled;
write=24.652 s, sync=104.256 s, total=3889.625 s; sync files=2,
longest=0.011 s, average=0.008 s; distance=0 kB, estimate=0 kB

--

[local]:5432 ashu(at)postgres=# select * from pg_stat_progress_checkpoint;
-[ RECORD 1 ]-----+-------------------------------------
pid | 22043
type | checkpoint
kind | immediate force wait requested time
start_lsn | 0/14C60F8
start_time | 2022-03-03 18:59:56.018662+05:30
phase | finalizing
buffers_total | 0
buffers_processed | 0
buffers_written | 0

Any idea why this mismatch?

==

I think we can add a couple of more information to this view -
start_time for buffer write operation and start_time for buffer sync
operation. These are two very time consuming tasks in a checkpoint and
people would find it useful to know how much time is being taken by
the checkpoint in I/O operation phase. thoughts?

--
With Regards,
Ashutosh Sharma.

On Wed, Mar 2, 2022 at 4:45 PM Nitin Jadhav
<nitinjadhavpostgres(at)gmail(dot)com> wrote:
>
> Thanks for reviewing.
>
> > > > I suggested upthread to store the starting timeline instead. This way you can
> > > > deduce whether it's a restartpoint or a checkpoint, but you can also deduce
> > > > other information, like what was the starting WAL.
> > >
> > > I don't understand why we need the timeline here to just determine
> > > whether it's a restartpoint or checkpoint.
> >
> > I'm not saying it's necessary, I'm saying that for the same space usage we can
> > store something a bit more useful. If no one cares about having the starting
> > timeline available for no extra cost then sure, let's just store the kind
> > directly.
>
> Fixed.
>
> > 2) Can't we just have these checks inside CASE-WHEN-THEN-ELSE blocks
> > directly instead of new function pg_stat_get_progress_checkpoint_kind?
> > + snprintf(ckpt_kind, MAXPGPATH, "%s%s%s%s%s%s%s%s%s",
> > + (flags == 0) ? "unknown" : "",
> > + (flags & CHECKPOINT_IS_SHUTDOWN) ? "shutdown " : "",
> > + (flags & CHECKPOINT_END_OF_RECOVERY) ? "end-of-recovery " : "",
> > + (flags & CHECKPOINT_IMMEDIATE) ? "immediate " : "",
> > + (flags & CHECKPOINT_FORCE) ? "force " : "",
> > + (flags & CHECKPOINT_WAIT) ? "wait " : "",
> > + (flags & CHECKPOINT_CAUSE_XLOG) ? "wal " : "",
> > + (flags & CHECKPOINT_CAUSE_TIME) ? "time " : "",
> > + (flags & CHECKPOINT_FLUSH_ALL) ? "flush-all" : "");
>
> Fixed.
> ---
>
> > 5) Do we need a special phase for this checkpoint operation? I'm not
> > sure in which cases it will take a long time, but it looks like
> > there's a wait loop here.
> > vxids = GetVirtualXIDsDelayingChkpt(&nvxids);
> > if (nvxids > 0)
> > {
> > do
> > {
> > pg_usleep(10000L); /* wait for 10 msec */
> > } while (HaveVirtualXIDsDelayingChkpt(vxids, nvxids));
> > }
>
> Yes. It is better to add a separate phase here.
> ---
>
> > Also, how about special phases for SyncPostCheckpoint(),
> > SyncPreCheckpoint(), InvalidateObsoleteReplicationSlots(),
> > PreallocXlogFiles() (it currently pre-allocates only 1 WAL file, but
> > it might be increase in future (?)), TruncateSUBTRANS()?
>
> SyncPreCheckpoint() is just incrementing a counter and
> PreallocXlogFiles() currently pre-allocates only 1 WAL file. I feel
> there is no need to add any phases for these as of now. We can add in
> the future if necessary. Added phases for SyncPostCheckpoint(),
> InvalidateObsoleteReplicationSlots() and TruncateSUBTRANS().
> ---
>
> > 6) SLRU (Simple LRU) isn't a phase here, you can just say
> > PROGRESS_CHECKPOINT_PHASE_PREDICATE_LOCK_PAGES.
> > +
> > + pgstat_progress_update_param(PROGRESS_CHECKPOINT_PHASE,
> > + PROGRESS_CHECKPOINT_PHASE_SLRU_PAGES);
> > CheckPointPredicate();
> >
> > And :s/checkpointing SLRU pages/checkpointing predicate lock pages
> >+ WHEN 9 THEN 'checkpointing SLRU pages'
>
> Fixed.
> ---
>
> > 7) :s/PROGRESS_CHECKPOINT_PHASE_FILE_SYNC/PROGRESS_CHECKPOINT_PHASE_PROCESS_FILE_SYNC_REQUESTS
>
> I feel PROGRESS_CHECKPOINT_PHASE_FILE_SYNC is a better option here as
> it describes the purpose in less words.
>
> > And :s/WHEN 11 THEN 'performing sync requests'/WHEN 11 THEN
> > 'processing file sync requests'
>
> Fixed.
> ---
>
> > 8) :s/Finalizing/finalizing
> > + WHEN 14 THEN 'Finalizing'
>
> Fixed.
> ---
>
> > 9) :s/checkpointing snapshots/checkpointing logical replication snapshot files
> > + WHEN 3 THEN 'checkpointing snapshots'
> > :s/checkpointing logical rewrite mappings/checkpointing logical
> > replication rewrite mapping files
> > + WHEN 4 THEN 'checkpointing logical rewrite mappings'
>
> Fixed.
> ---
>
> > 10) I'm not sure if it's discussed, how about adding the number of
> > snapshot/mapping files so far the checkpoint has processed in file
> > processing while loops of
> > CheckPointSnapBuild/CheckPointLogicalRewriteHeap? Sometimes, there can
> > be many logical snapshot or mapping files and users may be interested
> > in knowing the so-far-processed-file-count.
>
> I had thought about this while sharing the v1 patch and mentioned my
> views upthread. I feel it won't give meaningful progress information
> (It can be treated as statistics). Hence not included. Thoughts?
>
> > > > As mentioned upthread, there can be multiple backends that request a
> > > > checkpoint, so unless we want to store an array of pid we should store a number
> > > > of backend that are waiting for a new checkpoint.
> > >
> > > Yeah, you are right. Let's not go that path and store an array of
> > > pids. I don't see a strong use-case with the pid of the process
> > > requesting checkpoint. If required, we can add it later once the
> > > pg_stat_progress_checkpoint view gets in.
> >
> > I don't think that's really necessary to give the pid list.
> >
> > If you requested a new checkpoint, it doesn't matter if it's only your backend
> > that triggered it, another backend or a few other dozen, the result will be the
> > same and you have the information that the request has been seen. We could
> > store just a bool for that but having a number instead also gives a bit more
> > information and may allow you to detect some broken logic on your client code
> > if it keeps increasing.
>
> It's a good metric to show in the view but the information is not
> readily available. Additional code is required to calculate the number
> of requests. Is it worth doing that? I feel this can be added later if
> required.
>
> Please find the v4 patch attached and share your thoughts.
>
> Thanks & Regards,
> Nitin Jadhav
>
> On Tue, Mar 1, 2022 at 2:27 PM Nitin Jadhav
> <nitinjadhavpostgres(at)gmail(dot)com> wrote:
> >
> > > > 3) Why do we need this extra calculation for start_lsn?
> > > > Do you ever see a negative LSN or something here?
> > > > + ('0/0'::pg_lsn + (
> > > > + CASE
> > > > + WHEN (s.param3 < 0) THEN pow((2)::numeric, (64)::numeric)
> > > > + ELSE (0)::numeric
> > > > + END + (s.param3)::numeric)) AS start_lsn,
> > >
> > > Yes: LSN can take up all of an uint64; whereas the pgstat column is a
> > > bigint type; thus the signed int64. This cast is OK as it wraps
> > > around, but that means we have to take care to correctly display the
> > > LSN when it is > 0x7FFF_FFFF_FFFF_FFFF; which is what we do here using
> > > the special-casing for negative values.
> >
> > Yes. The extra calculation is required here as we are storing unit64
> > value in the variable of type int64. When we convert uint64 to int64
> > then the bit pattern is preserved (so no data is lost). The high-order
> > bit becomes the sign bit and if the sign bit is set, both the sign and
> > magnitude of the value changes. To safely get the actual uint64 value
> > whatever was assigned, we need the above calculations.
> >
> > > > 4) Can't you use timestamptz_in(to_char(s.param4)) instead of
> > > > pg_stat_get_progress_checkpoint_start_time? I don't quite understand
> > > > the reasoning for having this function and it's named as *checkpoint*
> > > > when it doesn't do anything specific to the checkpoint at all?
> > >
> > > I hadn't thought of using the types' inout functions, but it looks
> > > like timestamp IO functions use a formatted timestring, which won't
> > > work with the epoch-based timestamp stored in the view.
> >
> > There is a variation of to_timestamp() which takes UNIX epoch (float8)
> > as an argument and converts it to timestamptz but we cannot directly
> > call this function with S.param4.
> >
> > TimestampTz
> > GetCurrentTimestamp(void)
> > {
> > TimestampTz result;
> > struct timeval tp;
> >
> > gettimeofday(&tp, NULL);
> >
> > result = (TimestampTz) tp.tv_sec -
> > ((POSTGRES_EPOCH_JDATE - UNIX_EPOCH_JDATE) * SECS_PER_DAY);
> > result = (result * USECS_PER_SEC) + tp.tv_usec;
> >
> > return result;
> > }
> >
> > S.param4 contains the output of the above function
> > (GetCurrentTimestamp()) which returns Postgres epoch but the
> > to_timestamp() expects UNIX epoch as input. So some calculation is
> > required here. I feel the SQL 'to_timestamp(946684800 +
> > (S.param4::float / 1000000)) AS start_time' works fine. The value
> > '946684800' is equal to ((POSTGRES_EPOCH_JDATE - UNIX_EPOCH_JDATE) *
> > SECS_PER_DAY). I am not sure whether it is good practice to use this
> > way. Kindly share your thoughts.
> >
> > Thanks & Regards,
> > Nitin Jadhav
> >
> > On Mon, Feb 28, 2022 at 6:40 PM Matthias van de Meent
> > <boekewurm+postgres(at)gmail(dot)com> wrote:
> > >
> > > On Sun, 27 Feb 2022 at 16:14, Bharath Rupireddy
> > > <bharath(dot)rupireddyforpostgres(at)gmail(dot)com> wrote:
> > > > 3) Why do we need this extra calculation for start_lsn?
> > > > Do you ever see a negative LSN or something here?
> > > > + ('0/0'::pg_lsn + (
> > > > + CASE
> > > > + WHEN (s.param3 < 0) THEN pow((2)::numeric, (64)::numeric)
> > > > + ELSE (0)::numeric
> > > > + END + (s.param3)::numeric)) AS start_lsn,
> > >
> > > Yes: LSN can take up all of an uint64; whereas the pgstat column is a
> > > bigint type; thus the signed int64. This cast is OK as it wraps
> > > around, but that means we have to take care to correctly display the
> > > LSN when it is > 0x7FFF_FFFF_FFFF_FFFF; which is what we do here using
> > > the special-casing for negative values.
> > >
> > > As to whether it is reasonable: Generating 16GB of wal every second
> > > (2^34 bytes /sec) is probably not impossible (cpu <> memory bandwidth
> > > has been > 20GB/sec for a while); and that leaves you 2^29 seconds of
> > > database runtime; or about 17 years. Seeing that a cluster can be
> > > `pg_upgrade`d (which doesn't reset cluster LSN) since PG 9.0 from at
> > > least version PG 8.4.0 (2009) (and through pg_migrator, from 8.3.0)),
> > > we can assume that clusters hitting LSN=2^63 will be a reasonable
> > > possibility within the next few years. As the lifespan of a PG release
> > > is about 5 years, it doesn't seem impossible that there will be actual
> > > clusters that are going to hit this naturally in the lifespan of PG15.
> > >
> > > It is also possible that someone fat-fingers pg_resetwal; and creates
> > > a cluster with LSN >= 2^63; resulting in negative values in the
> > > s.param3 field. Not likely, but we can force such situations; and as
> > > such we should handle that gracefully.
> > >
> > > > 4) Can't you use timestamptz_in(to_char(s.param4)) instead of
> > > > pg_stat_get_progress_checkpoint_start_time? I don't quite understand
> > > > the reasoning for having this function and it's named as *checkpoint*
> > > > when it doesn't do anything specific to the checkpoint at all?
> > >
> > > I hadn't thought of using the types' inout functions, but it looks
> > > like timestamp IO functions use a formatted timestring, which won't
> > > work with the epoch-based timestamp stored in the view.
> > >
> > > > Having 3 unnecessary functions that aren't useful to the users at all
> > > > in proc.dat will simply eatup the function oids IMO. Hence, I suggest
> > > > let's try to do without extra functions.
> > >
> > > I agree that (1) could be simplified, or at least fully expressed in
> > > SQL without exposing too many internals. If we're fine with exposing
> > > internals like flags and type layouts, then (2), and arguably (4), can
> > > be expressed in SQL as well.
> > >
> > > -Matthias

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Ashutosh Sharma 2022-03-03 15:02:19 Re: pg_walinspect - a new extension to get raw WAL data and WAL stats
Previous Message Nitin Jadhav 2022-03-03 14:38:28 Re: pg_walinspect - a new extension to get raw WAL data and WAL stats