Re: when the startup process doesn't (logging startup delays)

From: Nitin Jadhav <nitinjadhavpostgres(at)gmail(dot)com>
To: Justin Pryzby <pryzby(at)telsasoft(dot)com>
Cc: Michael Paquier <michael(at)paquier(dot)xyz>, Robert Haas <robertmhaas(at)gmail(dot)com>, Bharath Rupireddy <bharath(dot)rupireddyforpostgres(at)gmail(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Andres Freund <andres(at)anarazel(dot)de>, Stephen Frost <sfrost(at)snowman(dot)net>, Magnus Hagander <magnus(at)hagander(dot)net>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: when the startup process doesn't (logging startup delays)
Date: 2021-09-13 15:02:54
Message-ID: CAMm1aWaVvuj6cvm0dNxfsN_EpmhE0n2vYud3pcPV75fc+JMCdg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

> I think you're confusing discussions.
>
> Robert was talking about initdb/bootstrap/single, and I separately and
> independently asked about hot standbys. It seems like Robert and I agreed
> about the desired behavior and there was no further discussion.

Sorry for including 2 separate points into one.

> Looking over this version, I realized something I (or you) should have
> noticed sooner: you've added the RegisterTimeout call to
> InitPostgres(), but that's for things that are used by all backends,
> and this is only used by the startup process. So it seems to me that
> the right place is StartupProcessMain. That would have the further
> advantage of allowing startup_progress_timeout_handler to be made
> static. begin_startup_progress_phase() and
> startup_progress_timeout_has_expired() are the actual API functions
> though so they will need to remain extern.

The earlier discussion wrt this point is as follows.

> > I also agree that this is the better place to do it. Hence modified
> > the patch accordingly. The condition "!AmStartupProcess()" is added to
> > differentiate whether the call is done from a startup process or some
> > other process. Actually StartupXLOG() gets called in 2 places. one in
> > StartupProcessMain() and the other in InitPostgres(). As the logging
> > of the startup progress is required only during the startup process
> > and not in the other cases,
>
> The InitPostgres() case occurs when the server is started in bootstrap
> mode (during initdb) or in single-user mode (postgres --single). I do
> not see any reason why we shouldn't produce progress messages in at
> least the latter case. I suspect that someone who is in the rather
> desperate scenario of having to use single-user mode would really like
> to know how long the server is going to take to start up.
>
> Perhaps during initdb we don't want messages, but I'm not sure that we
> need to do anything about that here. None of the messages that the
> server normally produces show up when you run initdb, so I guess they
> are getting redirected to /dev/null or something.
>
> So I don't think that using AmStartupProcess() for this purpose is right.

This point is really confusing. As per the earlier discussion we
concluded to include RegisterTimeout() call even in case of
InitPostgres() to support logging in case of single-user mode. Now if
we remove the RegisterTimeout() call from InitPostgres(), we are not
going to support that anymore. Is this what you're trying to convey?
or we should add some checks and disable the code to RegisterTimeout()
if it is other than single-user mode. I have added a check if
(!IsPostmasterEnvironment) in the attached patch for this scenario.
Kindly confirm my understanding.

> > Should this feature distinguish between crash recovery and archive recovery on
> > a hot standby ? Otherwise the standby will display this all the time.
> >
> >2021-08-14 16:13:33.139 CDT startup[11741] LOG: redo in progress, elapsed time: 124.42 s, current LSN: 0/EEE2100
> >
> >If so, I think maybe you'd check !InArchiveRecovery (but until Robert finishes
> > cleanup of xlog.c variables, I can't say that with much confidence).
>
> Hmm. My inclination is to think that on an actual standby, you
> wouldn't want to get messages like this, but if you were doing a
> point-in-time-recovery, then you would. So I think maybe
> InArchiveRecovery is not the right thing. Perhaps StandbyMode?

I also feel that the log messages should be recorded in case of
point-in-time-recovery. So I have added a check if (!StandbyMode) and
verified the replication and point-in-time-recovery scenario.

> > Some doc comments:
>
> Thanks for the suggestions. I will take care in the next patch.

Fixed.

> Michael is right. You updated some of the units based on Robert's suggestion
> to use MS, but didn't update all of the corresponding parts of the patch.
> guc.c says that the units are in MS, which means that unqualified values are
> interpretted as such. But postgresql.conf.sample still says "seconds", and
> guc.c says the default value is "10", and you still do:
>
> + interval_in_ms = log_startup_progress_interval * 1000;
>
> I checked that this currently does not interpret the value as ms:
> |./tmp_install/usr/local/pgsql/bin/postgres -D src/test/regress/tmp_check/data/ -c log_startup_progress_interval=1
> |2021-09-07 06:28:58.694 CDT startup[18865] LOG: redo in progress, elapsed time: 1.00 s, current LSN: 0/E94ED88
> |2021-09-07 06:28:59.694 CDT startup[18865] LOG: redo in progress, elapsed time: 2.00 s, current LSN: 0/10808EE0
> |2021-09-07 06:29:00.694 CDT startup[18865] LOG: redo in progress, elapsed time: 3.00 s, current LSN: 0/126B8C80
>
> (Also, the GUC value is in the range 0..INT_MAX, so multiplying and storing to
> another int could overflow.)
>
> I think the convention is to for GUC global vars to be initialized with the
> same default as in guc.c, so both should be 10000, like:
>
> +int log_startup_progress_interval = 10 * 1000 /* 10sec */

Following is the discussion done wrt this point. Kindly refer and
share your thoughts.

> > > I suggest making the GUC GUC_UNIT_MS rather than GUC_UNIT_S, but
> > > expressing the default in postgresl.conf.sample as 10s rather than
> > > 10000ms. I tried values measured in milliseconds just for testing
> > > purposes and didn't initially understand why it wasn't working. I
> > > don't think there's any reason it can't work.
> >
> > As suggested, I have changed it to GUC_UNIT_MS and kept the default
> > value to 10s. I would like to know the reason why it can't be
> > GUC_UNIT_S as we are expressing the values in terms of seconds.
>
> I mean, it *could* be. There's just no advantage. Values in seconds
> will work correctly either way. But values in milliseconds will only
> work if it's GUC_UNIT_MS. It seems to me that it's better to make more
> things work rather than fewer.

Thanks & Regards,
Nitin Jadhav
On Tue, Sep 7, 2021 at 5:19 PM Justin Pryzby <pryzby(at)telsasoft(dot)com> wrote:
>
> On Tue, Sep 07, 2021 at 03:07:15PM +0530, Nitin Jadhav wrote:
> > > Looking over this version, I realized something I (or you) should have
> > > noticed sooner: you've added the RegisterTimeout call to
> > > InitPostgres(), but that's for things that are used by all backends,
> > > and this is only used by the startup process. So it seems to me that
> > > the right place is StartupProcessMain. That would have the further
> > > advantage of allowing startup_progress_timeout_handler to be made
> > > static. begin_startup_progress_phase() and
> > > startup_progress_timeout_has_expired() are the actual API functions
> > > though so they will need to remain extern.
> > >
> > > I agree with Robert that a standby server should not continuously show timing
> > > messages for WAL replay.
> >
> > The earlier discussion wrt this point is as follows.
>
> I think you're confusing discussions.
>
> Robert was talking about initdb/bootstrap/single, and I separately and
> independently asked about hot standbys. It seems like Robert and I agreed
> about the desired behavior and there was no further discussion.
>
> > > Honestly, I don't see why we should have
> > > a GUC for what's proposed here. A value too low would bloat the logs
> > > with entries that are not that meaningful. A value too large would
> > > just prevent access to some information wanted. Wouldn't it be better
> > > to just pick up a value like 10s or 20s?
>
> I don't think bloating logs is a issue for values > 10sec.
>
> You agreed that it's important to choose the "right" value, but I think that
> will vary between users. Some installations with large checkpoint_timeout
> might anticipate taking 15+min to perform recovery, but others might even have
> a strict requirement that recovery must not take more than (say) 10sec; someone
> might want to use this to verify that, or to optimize the slow parts of
> recovery, with an interval that someone else might not care about.
>
> > > + {"log_startup_progress_interval", PGC_SIGHUP, LOGGING_WHEN,
> > > + gettext_noop("Sets the time interval between each progress update "
> > > + "of the startup process."),
> > > + gettext_noop("0 logs all messages. -1 turns this feature off."),
> > > + GUC_UNIT_MS,
> |+ 10, -1, INT_MAX,
> > > The unit is incorrect here, as that would default to 10ms, contrary to
> > > what the documentation says about 10s.
> >
> > Kindly refer the previous few discussions wrt this point.
>
> You copied and pasted unrelated emails, which isn't helpful.
>
> Michael is right. You updated some of the units based on Robert's suggestion
> to use MS, but didn't update all of the corresponding parts of the patch.
> guc.c says that the units are in MS, which means that unqualified values are
> interpretted as such. But postgresql.conf.sample still says "seconds", and
> guc.c says the default value is "10", and you still do:
>
> + interval_in_ms = log_startup_progress_interval * 1000;
>
> I checked that this currently does not interpret the value as ms:
> |./tmp_install/usr/local/pgsql/bin/postgres -D src/test/regress/tmp_check/data/ -c log_startup_progress_interval=1
> |2021-09-07 06:28:58.694 CDT startup[18865] LOG: redo in progress, elapsed time: 1.00 s, current LSN: 0/E94ED88
> |2021-09-07 06:28:59.694 CDT startup[18865] LOG: redo in progress, elapsed time: 2.00 s, current LSN: 0/10808EE0
> |2021-09-07 06:29:00.694 CDT startup[18865] LOG: redo in progress, elapsed time: 3.00 s, current LSN: 0/126B8C80
>
> (Also, the GUC value is in the range 0..INT_MAX, so multiplying and storing to
> another int could overflow.)
>
> I think the convention is to for GUC global vars to be initialized with the
> same default as in guc.c, so both should be 10000, like:
>
> +int log_startup_progress_interval = 10 * 1000 /* 10sec */
>
> --
> Justin

Attachment Content-Type Size
v14-0001-startup-process-progress.patch application/octet-stream 14.8 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Euler Taveira 2021-09-13 15:09:20 Re: TAP test for recovery_end_command
Previous Message Dipesh Pandit 2021-09-13 14:48:23 Re: .ready and .done files considered harmful