Re: Report checkpoint progress with pg_stat_progress_checkpoint (was: Report checkpoint progress in server logs)

From: Nitin Jadhav <nitinjadhavpostgres(at)gmail(dot)com>
To: Matthias van de Meent <boekewurm+postgres(at)gmail(dot)com>
Cc: Bharath Rupireddy <bharath(dot)rupireddyforpostgres(at)gmail(dot)com>, Ashutosh Sharma <ashu(dot)coek88(at)gmail(dot)com>, Julien Rouhaud <rjuju123(at)gmail(dot)com>, Bruce Momjian <bruce(at)momjian(dot)us>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Magnus Hagander <magnus(at)hagander(dot)net>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: Report checkpoint progress with pg_stat_progress_checkpoint (was: Report checkpoint progress in server logs)
Date: 2022-03-02 11:15:20
Message-ID: CAMm1aWYWmqys2O7UekQ0=ZSWSpG9h2qoEW7w46yiWM0eO1Lu7g@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Thanks for reviewing.

> > > I suggested upthread to store the starting timeline instead. This way you can
> > > deduce whether it's a restartpoint or a checkpoint, but you can also deduce
> > > other information, like what was the starting WAL.
> >
> > I don't understand why we need the timeline here to just determine
> > whether it's a restartpoint or checkpoint.
>
> I'm not saying it's necessary, I'm saying that for the same space usage we can
> store something a bit more useful. If no one cares about having the starting
> timeline available for no extra cost then sure, let's just store the kind
> directly.

Fixed.

> 2) Can't we just have these checks inside CASE-WHEN-THEN-ELSE blocks
> directly instead of new function pg_stat_get_progress_checkpoint_kind?
> + snprintf(ckpt_kind, MAXPGPATH, "%s%s%s%s%s%s%s%s%s",
> + (flags == 0) ? "unknown" : "",
> + (flags & CHECKPOINT_IS_SHUTDOWN) ? "shutdown " : "",
> + (flags & CHECKPOINT_END_OF_RECOVERY) ? "end-of-recovery " : "",
> + (flags & CHECKPOINT_IMMEDIATE) ? "immediate " : "",
> + (flags & CHECKPOINT_FORCE) ? "force " : "",
> + (flags & CHECKPOINT_WAIT) ? "wait " : "",
> + (flags & CHECKPOINT_CAUSE_XLOG) ? "wal " : "",
> + (flags & CHECKPOINT_CAUSE_TIME) ? "time " : "",
> + (flags & CHECKPOINT_FLUSH_ALL) ? "flush-all" : "");

Fixed.
---

> 5) Do we need a special phase for this checkpoint operation? I'm not
> sure in which cases it will take a long time, but it looks like
> there's a wait loop here.
> vxids = GetVirtualXIDsDelayingChkpt(&nvxids);
> if (nvxids > 0)
> {
> do
> {
> pg_usleep(10000L); /* wait for 10 msec */
> } while (HaveVirtualXIDsDelayingChkpt(vxids, nvxids));
> }

Yes. It is better to add a separate phase here.
---

> Also, how about special phases for SyncPostCheckpoint(),
> SyncPreCheckpoint(), InvalidateObsoleteReplicationSlots(),
> PreallocXlogFiles() (it currently pre-allocates only 1 WAL file, but
> it might be increase in future (?)), TruncateSUBTRANS()?

SyncPreCheckpoint() is just incrementing a counter and
PreallocXlogFiles() currently pre-allocates only 1 WAL file. I feel
there is no need to add any phases for these as of now. We can add in
the future if necessary. Added phases for SyncPostCheckpoint(),
InvalidateObsoleteReplicationSlots() and TruncateSUBTRANS().
---

> 6) SLRU (Simple LRU) isn't a phase here, you can just say
> PROGRESS_CHECKPOINT_PHASE_PREDICATE_LOCK_PAGES.
> +
> + pgstat_progress_update_param(PROGRESS_CHECKPOINT_PHASE,
> + PROGRESS_CHECKPOINT_PHASE_SLRU_PAGES);
> CheckPointPredicate();
>
> And :s/checkpointing SLRU pages/checkpointing predicate lock pages
>+ WHEN 9 THEN 'checkpointing SLRU pages'

Fixed.
---

> 7) :s/PROGRESS_CHECKPOINT_PHASE_FILE_SYNC/PROGRESS_CHECKPOINT_PHASE_PROCESS_FILE_SYNC_REQUESTS

I feel PROGRESS_CHECKPOINT_PHASE_FILE_SYNC is a better option here as
it describes the purpose in less words.

> And :s/WHEN 11 THEN 'performing sync requests'/WHEN 11 THEN
> 'processing file sync requests'

Fixed.
---

> 8) :s/Finalizing/finalizing
> + WHEN 14 THEN 'Finalizing'

Fixed.
---

> 9) :s/checkpointing snapshots/checkpointing logical replication snapshot files
> + WHEN 3 THEN 'checkpointing snapshots'
> :s/checkpointing logical rewrite mappings/checkpointing logical
> replication rewrite mapping files
> + WHEN 4 THEN 'checkpointing logical rewrite mappings'

Fixed.
---

> 10) I'm not sure if it's discussed, how about adding the number of
> snapshot/mapping files so far the checkpoint has processed in file
> processing while loops of
> CheckPointSnapBuild/CheckPointLogicalRewriteHeap? Sometimes, there can
> be many logical snapshot or mapping files and users may be interested
> in knowing the so-far-processed-file-count.

I had thought about this while sharing the v1 patch and mentioned my
views upthread. I feel it won't give meaningful progress information
(It can be treated as statistics). Hence not included. Thoughts?

> > > As mentioned upthread, there can be multiple backends that request a
> > > checkpoint, so unless we want to store an array of pid we should store a number
> > > of backend that are waiting for a new checkpoint.
> >
> > Yeah, you are right. Let's not go that path and store an array of
> > pids. I don't see a strong use-case with the pid of the process
> > requesting checkpoint. If required, we can add it later once the
> > pg_stat_progress_checkpoint view gets in.
>
> I don't think that's really necessary to give the pid list.
>
> If you requested a new checkpoint, it doesn't matter if it's only your backend
> that triggered it, another backend or a few other dozen, the result will be the
> same and you have the information that the request has been seen. We could
> store just a bool for that but having a number instead also gives a bit more
> information and may allow you to detect some broken logic on your client code
> if it keeps increasing.

It's a good metric to show in the view but the information is not
readily available. Additional code is required to calculate the number
of requests. Is it worth doing that? I feel this can be added later if
required.

Please find the v4 patch attached and share your thoughts.

Thanks & Regards,
Nitin Jadhav

On Tue, Mar 1, 2022 at 2:27 PM Nitin Jadhav
<nitinjadhavpostgres(at)gmail(dot)com> wrote:
>
> > > 3) Why do we need this extra calculation for start_lsn?
> > > Do you ever see a negative LSN or something here?
> > > + ('0/0'::pg_lsn + (
> > > + CASE
> > > + WHEN (s.param3 < 0) THEN pow((2)::numeric, (64)::numeric)
> > > + ELSE (0)::numeric
> > > + END + (s.param3)::numeric)) AS start_lsn,
> >
> > Yes: LSN can take up all of an uint64; whereas the pgstat column is a
> > bigint type; thus the signed int64. This cast is OK as it wraps
> > around, but that means we have to take care to correctly display the
> > LSN when it is > 0x7FFF_FFFF_FFFF_FFFF; which is what we do here using
> > the special-casing for negative values.
>
> Yes. The extra calculation is required here as we are storing unit64
> value in the variable of type int64. When we convert uint64 to int64
> then the bit pattern is preserved (so no data is lost). The high-order
> bit becomes the sign bit and if the sign bit is set, both the sign and
> magnitude of the value changes. To safely get the actual uint64 value
> whatever was assigned, we need the above calculations.
>
> > > 4) Can't you use timestamptz_in(to_char(s.param4)) instead of
> > > pg_stat_get_progress_checkpoint_start_time? I don't quite understand
> > > the reasoning for having this function and it's named as *checkpoint*
> > > when it doesn't do anything specific to the checkpoint at all?
> >
> > I hadn't thought of using the types' inout functions, but it looks
> > like timestamp IO functions use a formatted timestring, which won't
> > work with the epoch-based timestamp stored in the view.
>
> There is a variation of to_timestamp() which takes UNIX epoch (float8)
> as an argument and converts it to timestamptz but we cannot directly
> call this function with S.param4.
>
> TimestampTz
> GetCurrentTimestamp(void)
> {
> TimestampTz result;
> struct timeval tp;
>
> gettimeofday(&tp, NULL);
>
> result = (TimestampTz) tp.tv_sec -
> ((POSTGRES_EPOCH_JDATE - UNIX_EPOCH_JDATE) * SECS_PER_DAY);
> result = (result * USECS_PER_SEC) + tp.tv_usec;
>
> return result;
> }
>
> S.param4 contains the output of the above function
> (GetCurrentTimestamp()) which returns Postgres epoch but the
> to_timestamp() expects UNIX epoch as input. So some calculation is
> required here. I feel the SQL 'to_timestamp(946684800 +
> (S.param4::float / 1000000)) AS start_time' works fine. The value
> '946684800' is equal to ((POSTGRES_EPOCH_JDATE - UNIX_EPOCH_JDATE) *
> SECS_PER_DAY). I am not sure whether it is good practice to use this
> way. Kindly share your thoughts.
>
> Thanks & Regards,
> Nitin Jadhav
>
> On Mon, Feb 28, 2022 at 6:40 PM Matthias van de Meent
> <boekewurm+postgres(at)gmail(dot)com> wrote:
> >
> > On Sun, 27 Feb 2022 at 16:14, Bharath Rupireddy
> > <bharath(dot)rupireddyforpostgres(at)gmail(dot)com> wrote:
> > > 3) Why do we need this extra calculation for start_lsn?
> > > Do you ever see a negative LSN or something here?
> > > + ('0/0'::pg_lsn + (
> > > + CASE
> > > + WHEN (s.param3 < 0) THEN pow((2)::numeric, (64)::numeric)
> > > + ELSE (0)::numeric
> > > + END + (s.param3)::numeric)) AS start_lsn,
> >
> > Yes: LSN can take up all of an uint64; whereas the pgstat column is a
> > bigint type; thus the signed int64. This cast is OK as it wraps
> > around, but that means we have to take care to correctly display the
> > LSN when it is > 0x7FFF_FFFF_FFFF_FFFF; which is what we do here using
> > the special-casing for negative values.
> >
> > As to whether it is reasonable: Generating 16GB of wal every second
> > (2^34 bytes /sec) is probably not impossible (cpu <> memory bandwidth
> > has been > 20GB/sec for a while); and that leaves you 2^29 seconds of
> > database runtime; or about 17 years. Seeing that a cluster can be
> > `pg_upgrade`d (which doesn't reset cluster LSN) since PG 9.0 from at
> > least version PG 8.4.0 (2009) (and through pg_migrator, from 8.3.0)),
> > we can assume that clusters hitting LSN=2^63 will be a reasonable
> > possibility within the next few years. As the lifespan of a PG release
> > is about 5 years, it doesn't seem impossible that there will be actual
> > clusters that are going to hit this naturally in the lifespan of PG15.
> >
> > It is also possible that someone fat-fingers pg_resetwal; and creates
> > a cluster with LSN >= 2^63; resulting in negative values in the
> > s.param3 field. Not likely, but we can force such situations; and as
> > such we should handle that gracefully.
> >
> > > 4) Can't you use timestamptz_in(to_char(s.param4)) instead of
> > > pg_stat_get_progress_checkpoint_start_time? I don't quite understand
> > > the reasoning for having this function and it's named as *checkpoint*
> > > when it doesn't do anything specific to the checkpoint at all?
> >
> > I hadn't thought of using the types' inout functions, but it looks
> > like timestamp IO functions use a formatted timestring, which won't
> > work with the epoch-based timestamp stored in the view.
> >
> > > Having 3 unnecessary functions that aren't useful to the users at all
> > > in proc.dat will simply eatup the function oids IMO. Hence, I suggest
> > > let's try to do without extra functions.
> >
> > I agree that (1) could be simplified, or at least fully expressed in
> > SQL without exposing too many internals. If we're fine with exposing
> > internals like flags and type layouts, then (2), and arguably (4), can
> > be expressed in SQL as well.
> >
> > -Matthias

Attachment Content-Type Size
v4-0001-pg_stat_progress_checkpoint-view.patch application/octet-stream 34.9 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Fabien COELHO 2022-03-02 11:41:18 Re: [HACKERS] WIP aPatch: Pgbench Serialization and deadlock errors
Previous Message Wolfgang Walther 2022-03-02 10:46:40 Re: [PATCH] Add reloption for views to enable RLS