Re: Add Information during standby recovery conflicts

From: "Drouvot, Bertrand" <bdrouvot(at)amazon(dot)com>
To: Masahiko Sawada <masahiko(dot)sawada(at)2ndquadrant(dot)com>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Add Information during standby recovery conflicts
Date: 2020-07-11 12:55:57
Message-ID: 85c21e8e-4b12-9ff6-92bf-24fac2166632@amazon.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers


On 7/7/20 9:43 AM, Masahiko Sawada wrote:
> Fully makes sense, the new patch version attached is now producing:
>> 2020-07-06 06:10:36.022 UTC [14035] LOG: waiting for recovery conflict
>> on snapshot
> How about adding the subject? that is, "recovery is waiting for
> recovery conflict on %s" or "recovery process <pid> is waiting for
> conflict on %s".

The subject is now added in the new attached patch (I did not include
the pid as it is part of the log prefix).

It now looks like:

2020-07-11 12:00:41.092 UTC [23217] LOG:  recovery is waiting for
recovery conflict on snapshot
2020-07-11 12:00:41.092 UTC [23217] DETAIL:  There is 1 blocking
connection(s).
2020-07-11 12:00:41.092 UTC [23217] CONTEXT:  WAL redo at 0/4A0A6BF0 for
Heap2/CLEAN: remxid 1128
        WAL record received at 2020-07-11 12:00:41.092231+00
        tbs 1663 db 13586 rel 16805, fork main, blkno 0
>
>> 2020-07-06 06:10:36.022 UTC [14035] DETAIL: WAL record received at
>> 2020-07-06 06:10:36.021963+00.
>> Tablespace/database/relation are 1663/13586/16672, fork is main
>> and block is 0.
>> There is 1 blocking connection(s).
> To follow the existing log message, perhaps this can be something like
> "WAL record received at %s, rel %u/%u/%u, fork %s, blkno %u. %d
> processes". But I'm not sure the errdetail is the best place to
> display the WAL information as I mentioned in the latter part of this
> email.

moved to the context and formatted the same way as the current
Standby/LOCK context.

> Ok. Let's keep this thread for the new attached patch that focus on the
>> recovery process waiting.
> Thank you for updating the patch!
>
> I've tested the latest patch.

Thank you for testing and reviewing!

> On recovery conflict on lock and on
> bufferpin, if max_standby_streaming_delay is disabled (set to -1), the
> logs don't appear even if log_recovery_conflicts is true.

Nice catch! it is fixed in the new attached patch (the log reporting has
been moved out of StandbyTimeoutHandler()).

>
> Here is random comments on the code:
>
> + recovery_conflict_main_message = psprintf("waiting for
> recovery conflict on %s",
> +
> get_procsignal_reason_desc(reason));
> :
> + ereport(LOG,
> + (errmsg("%s", recovery_conflict_main_message),
> + errdetail("%s\n" "There is %d blocking
> connection(s).", wal_record_detail_str, num_waitlist_entries)));
>
> It's not translation-support-friendly. I think the message "waiting
> for recovery conflict on %s" should be surrounded by _(). Or we can
> just put it to ereport as follows:
>
> ereport(LOG,
> (errmsg("waiting for recovery conflicts on %s",
> get_procsignal_reason_desc(reason))
> ...

changed in the new attached patch.

>
> ---
> + oldcontext = MemoryContextSwitchTo(ErrorContext);
> + econtext = error_context_stack;
> +
> + if (XLogRecGetBlockTag(econtext->arg, 0, &rnode, &forknum, &blknum))
>
> I don't think it's a good idea to rely on error_context_stack because
> other codes might set another error context before reaching here in
> the future.

right, changed in the new attached patch: this is now done in
rm_redo_error_callback() and using the XLogReaderState passed as argument.

>
> ---
> + if (XLogRecGetBlockTag(econtext->arg, 0, &rnode, &forknum, &blknum))
> + wal_record_detail_str = psprintf("WAL record received
> at %s.\nTablespace/database/relation are %u/%u/%u, fork is %s and
> block is %u.",
> + receipt_time_str,
> rnode.spcNode, rnode.dbNode, rnode.relNode,
> + forkNames[forknum],
> + blknum);
>
> There might be a block tag in block ids other than 0.

right, fixed in the new attached patch.

> I'm not sure the
> errdetail is the best place where we display WAL information.

moved to context in the new attached patch.

> For
> instance, we display both the relation oid and block number depending
> on RM as follows:
>
> 2020-07-07 15:50:30.733 JST [13344] LOG: waiting for recovery conflict on lock
> 2020-07-07 15:50:30.733 JST [13344] DETAIL: WAL record received at
> 2020-07-07 15:50:27.73378+09.
> There is 1 blocking connection(s).
> 2020-07-07 15:50:30.733 JST [13344] CONTEXT: WAL redo at 0/3000028
> for Standby/LOCK: xid 506 db 13586 rel 16384
>
> I wonder if we can display the details of redo WAL information by
> improving xlog_outdesc() or rm_redo_error_callback() so that it
> displays relfilenode, forknum, and block number. What do you think?

I think that fully make sense to move this to rm_redo_error_callback().

This is where the information is now been displayed in the new attached
patch.

>
> ---
> + /* display wal record information */
> + if (log_recovery_conflicts)
> + LogBlockedWalRecordInfo(num_waitlist_entries, reason);
>
> I'm concerned that we should log the recovery conflict information
> every time when a recovery conflict happens from the perspective of
> overheads and the amount of the logs. Can we logs that information
> after waiting for deadlock_timeouts secs like log_lock_waits or
> waiting for the fixed duration?

The new attached patch is now waiting for deadlock_timeout duration.

>
> ---
> @@ -609,6 +682,10 @@ StandbyTimeoutHandler(void)
> /* forget any pending STANDBY_DEADLOCK_TIMEOUT request */
> disable_timeout(STANDBY_DEADLOCK_TIMEOUT, false);
>
> + /* display wal record information */
> + if (log_recovery_conflicts)
> + LogBlockedWalRecordInfo(-1,
> PROCSIG_RECOVERY_CONFLICT_BUFFERPIN);
> +
> SendRecoveryConflictWithBufferPin(PROCSIG_RECOVERY_CONFLICT_BUFFERPIN);
> }
>
> ResolveRecoveryConflictWithBufferPin() which sets a timer to call
> StandbyTimeoutHandler() can be called multiple times even if the
> recovery is waiting for one buffer pin. I think we should avoid
> logging the same contents multiple times.

I do agree, only the first pass is now been logged.

>
> ---
> -
> + {
> + {"log_recovery_conflicts", PGC_SUSET, LOGGING_WHAT,
> + gettext_noop("Logs standby recovery conflicts."),
> + NULL
> + },
> + &log_recovery_conflicts,
> + true,
> + NULL, NULL, NULL
> + },
>
> Other logging parameters such as log_lock_waits is false by default. I
> think this parameter can also be false by default but is there any
> reason to enable it by default?

now set to false by default.

>
> Since this parameter applies only to the startup process, I think it
> should be PGC_SIGHUP.

changed that way.

>
> ---
> + /* display wal record information */
> + if (log_recovery_conflicts)
> + LogBlockedWalRecordInfo(CountDBBackends(dbid),
> PROCSIG_RECOVERY_CONFLICT_DATABASE);
>
> We log the recovery conflict into the server log but we don't update
> the process title to append "waiting". While discussing the process
> title update on recovery conflict, I got the review comment[1] that we
> don't need to update the process title because no wait occurs when
> recovery conflict with database happens. As the comment says, recovery
> is canceling the existing processes on the database being removed, but
> not waiting for something.

Ok, I keep reporting the conflict but changed the wording for this
particular case.

Thanks

Bertrand

Attachment Content-Type Size
v1-0002-Add-Standby-Recovery-Conflicts-BlockedOnly-Info.patch text/plain 11.5 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message vignesh C 2020-07-11 13:16:43 Re: Include access method in listTables output
Previous Message Dilip Kumar 2020-07-11 12:37:28 INSERT INTO SELECT, Why Parallelism is not selected?