Re: when the startup process doesn't (logging startup delays)

From: Amul Sul <sulamul(at)gmail(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Nitin Jadhav <nitinjadhavpostgres(at)gmail(dot)com>, Justin Pryzby <pryzby(at)telsasoft(dot)com>, Michael Paquier <michael(at)paquier(dot)xyz>, Bharath Rupireddy <bharath(dot)rupireddyforpostgres(at)gmail(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Andres Freund <andres(at)anarazel(dot)de>, Stephen Frost <sfrost(at)snowman(dot)net>, Magnus Hagander <magnus(at)hagander(dot)net>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: when the startup process doesn't (logging startup delays)
Date: 2021-10-13 13:05:33
Message-ID: CAAJ_b94GthU=2OpY1B9airmb-AZ288qhKr+ewxfQh3e_p=yRNg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, Sep 29, 2021 at 11:10 PM Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
>
> On Wed, Sep 29, 2021 at 10:08 AM Nitin Jadhav
> <nitinjadhavpostgres(at)gmail(dot)com> wrote:
> > Sorry. There was a misunderstanding about this and for the patch
> > shared on September 27th, I had tested for the value '0' and observed
> > that no progress messages were getting logged, probably the time at
> > which 'enable_timeout_at' is getting called is past the 'next_timeout'
> > value. This behaviour is completely dependent on the system. Now added
> > an extra condition to disable the feature in case of '0' setting.
>
> Oh, interesting. I failed to consider that the behavior might vary
> from one system to another.
>
> I just noticed something else which I realize is the indirect result
> of my own suggestion but which doesn't actually look all that nice.
> You've now got a call to RegisterTimeout(STARTUP_PROGRESS_TIMEOUT,
> ...) in InitPostgres, guarded by ! IsPostmasterEnvironment, and then
> another one in StartupProcessMain(). I think that's so that the
> feature works in single-user mode, but that means that technically,
> we're not reporting on the progress of the startup process. We're
> reporting progress on the startup operations that are normally
> performed by the startup process. But that means that the
> documentation isn't quite accurate (because it mentions the startup
> process specifically) and that the placement of the code in startup.c
> is suspect (because that's specifically for the startup process) and
> that basically every instance of the word "process" in the patch is
> technically a little bit wrong. I'm not sure if that's a big enough
> problem to be worth worrying about or exactly what we ought to do
> about it, but it doesn't seem fantastic. At a minimum, I think we
> should probably try to eliminate as many references to the startup
> process as we can, and talk about startup progress or startup
> operations or something like that.
>
> + * Start timestamp of the operation that occur during startup process.
>
> This is not correct grammar - it would need to be "operations that
> occur" or "operation that occurs". But neither seems particularly
> clear about what the variable actually does. How about "Time at which
> the most recent startup operation started"?
>
> + * Indicates the timestamp at which the timer was supposed to expire.
>
> "Indicates" can be deleted, but also I think it would be better to
> state the purpose of the timer i.e. "Timestamp at which the next
> startup progress message should be logged."
>
> + enable_timeout_at(STARTUP_PROGRESS_TIMEOUT, next_timeout);
> + scheduled_startup_progress_timeout = next_timeout;
> + startup_progress_timer_expired = false;
>
> I think you should set startup_progress_timer_expired to false before
> calling enable_timeout_at. Otherwise there's a race condition. It's
> unlikely that the timer could expire and the interrupt handler fire
> before we reach startup_progress_timer_expired = false, but it seems
> like there's no reason to take a chance.
>
> + * Calculates the timestamp at which the next timer should expire and enables
>
> So in some places you have verbs with an "s" on the end, like here,
> and in other places without, like in the next example. In "telegraph
> style" comments like this, this implicit subject is "it" or "this,"
> but you don't write that. However you write the rest of the sentence
> as if it were there. So this should say "calculate" and "enable"
> rather than "calculates" and "enables".
>
> + * Schedule a wakeup call for logging the progress of startup process.
>
> This isn't really an accurate description, I think. It's not
> scheduling anything, and I don't know what a "wakeup call" is anyway.
> "Set a flag indicating that it's time to log a progress report" might
> be better.
>
> + * Sets the start timestamp of the current operation and also enables the
>
> Set. enable.
>
> + * timeout for logging the progress of startup process.
>
> I think you could delete "for logging the progress of startup process"
> here; that seems clear enough, and this reads a bit awkwardly. If you
> want to keep something like this I wrote write "...enable the timeout
> so that the progress of the startup progress will be logged."
>
> + * the timer if it did, otheriwse return false.
>
> otherwise
>
> + * Begin the startup progress phase to report the progress of syncing
> + * data directory (syncfs).
>
> All of the comments that start with "Begin the startup progress phase"
> should instead start with "Begin startup progress phase".
>
> I think this could be condensed down to "Prepare to report progress
> syncing the data directory via syncfs", and likewise "Prepare to
> report progress of the pre-fsync phase", "Prepare to report progress
> resetting unlogged relations," etc.
>
> + gettext_noop("Sets the time interval between each progress update "
> + "of the startup process."),
>
> This is actually inaccurate. It's sort of the same problem I was
> worried about with respect to the documentation: it's NOT the interval
> between progress updates, because it applies separately to each
> operation. We need to say something that makes that clear, or at least
> that doesn't get overtly the opposite impression. It's hard to do that
> briefly, but maybe something like "Time between progress updates for
> long-running startup operations"?
>
> Whatever we use here could also be the comment for
> log_startup_progress_interval.
>
> + * Logs the startup progress message if the timer has expired.
>
> the -> a
>

In addition I have little concern about ereport_startup_progress() use:

+ if (!StandbyMode)
+ ereport_startup_progress("redo in progress,
elapsed time: %ld.%02d s, current LSN: %X/%X",
+ LSN_FORMAT_ARGS(ReadRecPtr));

The format string, its input, input count, and input position
is not properly aligned. The input for "elapsed time: %ld.%02d s" is
getting value implicitly inside ereport_startup_progress() which
could be confusing, usually, the input count should be the same as
expected in the format string in the same sequence that is expected to
appear in the log message.

I think the "elapsed time" part can be implicitly added to the error
message inside ereport_startup_progress() which is common to all
calls.

Regards,
Amul

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Stephen Frost 2021-10-13 13:16:37 Re: storing an explicit nonce
Previous Message osumi.takamichi@fujitsu.com 2021-10-13 12:58:34 RE: [BUG] Logical replication failure "ERROR: could not map filenode "base/13237/442428" to relation OID" with catalog modifying txns