Re: Add WAL recovery messages with log_wal_traffic GUC (was: add recovery, backup, archive, streaming etc. activity messages to server logs along with ps display)

From: Bharath Rupireddy <bharath(dot)rupireddyforpostgres(at)gmail(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Justin Pryzby <pryzby(at)telsasoft(dot)com>, Alvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org>, Michael Paquier <michael(at)paquier(dot)xyz>, "Bossart, Nathan" <bossartn(at)amazon(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: Add WAL recovery messages with log_wal_traffic GUC (was: add recovery, backup, archive, streaming etc. activity messages to server logs along with ps display)
Date: 2022-07-26 10:06:57
Message-ID: CALj2ACVyLjS4Ezhuf_Y4HATED+QT+OF0Y0uk73SOf2VDH=-GZg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Fri, May 13, 2022 at 6:41 PM Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
>
> On Fri, Apr 29, 2022 at 5:11 AM Bharath Rupireddy
> <bharath(dot)rupireddyforpostgres(at)gmail(dot)com> wrote:
> > Here's the rebased v9 patch.
>
> This seems like it has enormous overlap with the existing
> functionality that we have from log_startup_progress_interval.
>
> I think that facility is also better-designed than this one. It prints
> out a message based on elapsed time, whereas this patch prints out a
> message based progress through the WAL. That means that if WAL replay
> isn't actually advancing for some reason, you just won't get any log
> messages and you don't know whether it's advancing slowly or not at
> all or the server is just hung. With that facility you can distinguish
> those cases.
>
> Also, if for some reason we do think that amount of WAL replayed is
> the right metric, rather than time, why would we only allow high=1
> segment and low=128 segments, rather than say any number of MB or GB
> that the user would like to configure?
>
> I suggest that if log_startup_progress_interval doesn't meet your
> needs here, we should try to understand why not and maybe enhance it,
> instead of adding a separate facility.

After thinking for a while, I agree with Robert and others that we
could leverage the existing log_startup_progress_interval mechanism
for reporting which WAL file currently is being replayed. I added
current TLI (helps to construct the WAL file name from current LSN)
and current WAL file source (helps to know where the WAL files was
fetched from) to the existing "redo in progress, elapsed time:..."
message. This very well serves the purpose of identifying the issues
such as the restore command taking a lot of time (>
log_startup_progress_interval for instance), WAL replay rate on
standby or primary for long recoveries and so on.

However, ereport_startup_progress isn't enabled on standby to not let
it bloat the server logs. I believe the "redo in progress, elapsed
time:..." message can provide some important info/metric for standby
too and there's no way for the users to enable it on standbys today.
For instance, users can know how well the standby fares in replaying,
they can figure this out, by looking at two or more such messages. If
enabled, with default value of 10 sec for
log_startup_progress_interval, the standby can emit 8640 messages per
day which is too much. I'm not sure if we are okay to change the
default value of log_startup_progress_interval to say 1min or 5min so
that 1440 messages are emitted. In production environments, typically
users may or may not be interested if recovery takes just 10sec, but
they really are interested if it takes in the order of minutes.
Basically, I would like to enable "redo in progress, elapsed time:..."
message for standbys too.

Thoughts?

PSA v10 patch with enhanced "redo in progress, elapsed time:..."
message. Note that it's not a final patch though.

Regards,
Bharath Rupireddy.

Attachment Content-Type Size
v10-0001-Add-WAL-recovery-info-to-startup-progress-log-me.patch application/octet-stream 1.3 KB

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Etsuro Fujita 2022-07-26 10:19:43 Re: Fast COPY FROM based on batch insert
Previous Message Dilip Kumar 2022-07-26 10:02:28 Re: Max compact as an FSM strategy