| From: | Soumya S Murali <soumyamurali(dot)work(at)gmail(dot)com> |
|---|---|
| To: | Álvaro Herrera <alvherre(at)kurilemu(dot)de> |
| Cc: | Michael Banck <mbanck(at)gmx(dot)net>, pgsql-hackers(at)lists(dot)postgresql(dot)org, melanieplageman(at)gmail(dot)com |
| Subject: | Re: [PATCH] Expose checkpoint timestamp and duration in pg_stat_checkpointer |
| Date: | 2025-11-26 10:15:06 |
| Message-ID: | CAMtXxw-Pv4Tr_5L2fLxoOOBKmsx9BSUoXZgEG6DTzrDn0mg7UA@mail.gmail.com |
| Views: | Whole Thread | Raw Message | Download mbox | Resend email |
| Thread: | |
| Lists: | pgsql-hackers |
On Mon, Nov 24, 2025 at 3:37 PM Álvaro Herrera <alvherre(at)kurilemu(dot)de> wrote:
>
> On 2025-Nov-24, Michael Banck wrote:
>
> > In general I doubt how much those gauges (as oppposed to counters) only
> > pertaining to the last checkpoint are useful in pg_stat_checkpointer.
> > What would be the use case for those two values?
>
> I think it's useful to know how long checkpoint has to work. It's a bit
> lame to have only one duration (the last one), but at least with this
> arrangement you can have external monitoring software connect to the
> server, extract that value and save it somewhere else. Monitoring
> systems do this all the time, and we've been waiting for a better
> implementation to store monitoring data inside Postgres for years. I
> think we shouldn't block this proposal just because of this issue,
> because it can clearly be useful.
>
> However, I'm not sure I'm very interested in knowing only the duration
> of the checkpoint. I mean, much of the time the duration is going to be
> whatever fraction of the checkpoint timeout you have as
> checkpoint_completion_target, right? Which includes sleeps. So I think
> you really want two durations: one is the duration itself, and the other
> is what fraction of that did the checkpointer sleep in order to achieve
> that duration. So you know how much time checkpointer spent trying to
> get the operating system do stuff rather than just sit there waiting.
> We already have that data, kinda, in write_time and sync_time, but those
> are cumulative rather than just for the last one. (I guess you can have
> the monitoring system compute the deltas as it finds each new
> checkpoint.) I'm not sure how good this system is.
Thank you for the detailed thoughts. I agree that having only the last
checkpoint’s duration is limited, but it still gives monitoring tools
a concrete value they can sample and store over time, which is better
than relying only on counters and logs. I will try whether separating
total duration and actual active write/sync time (vs. sleep time) can
be exposed in a more clearer way, as that seems useful for deeper
diagnosis.
> In the past, I looked at a couple of monitoring dashboards offered by
> cloud vendors, searching for anything valuable in terms of checkpoints.
> What I saw was very disappointing -- mostly just "how many checkpoints
> per minute", which is mostly flat zero with periodic spikes. Totally
> useless. Does anybody know if some vendor has good charts for this?
> Also, if we were to add this new proposed duration, how could these
> charts improve?
I will look into this in more depth. Will let you know if I find
something concrete.
Regards
Soumya
| From | Date | Subject | |
|---|---|---|---|
| Next Message | Bernice Southey | 2025-11-26 10:33:53 | Second RewriteQuery complains about first RewriteQuery in edge case |
| Previous Message | Shlok Kyal | 2025-11-26 09:51:17 | Re: How can end users know the cause of LR slot sync delays? |