Re: Add Information during standby recovery conflicts

From: "Drouvot, Bertrand" <bdrouvot(at)amazon(dot)com>
To: Masahiko Sawada <masahiko(dot)sawada(at)2ndquadrant(dot)com>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Add Information during standby recovery conflicts
Date: 2020-07-01 12:52:26
Message-ID: daeecc9d-567a-cad8-298a-6f12ed7cb665@amazon.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hey,

On 6/29/20 10:55 AM, Masahiko Sawada wrote:
> CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you can confirm the sender and know the content is safe.
>
>
>
> On Thu, 18 Jun 2020 at 16:28, Drouvot, Bertrand <bdrouvot(at)amazon(dot)com> wrote:
>> Hi hackers,
>>
>> I've attached a patch to add information during standby recovery conflicts.
>> The motive behind is to be able to get some information:
>>
>> On the apply side
>> On the blocker(s) side
>>
>> Motivation:
>>
>> When a standby recovery conflict occurs it could be useful to get more information to be able to dive deep on the root cause and find a way to avoid/mitigate new occurrences.
> I think this is a good feature. Like log_lock_waits, it will help the
> users to investigate recovery conflict issues.
exactly, thanks for looking at it.
>
>> Adding this information would make the investigations easier, it could help answering questions like:
>>
>> On which LSN was the WAL apply blocked?
>> What was the purpose of the bocked WAL record?
>> On which relation (if any) was the blocked WAL record related to?
>> What was the blocker(s) doing?
>> When did the blocker(s) started their queries (if any)?
>> What was the blocker(s) waiting for? on which wait event?
>>
>> Technical context and proposal:
>>
>> There is 2 points in this patch:
>>
>> Add the information about the blocked WAL record. This is done in standby.c (ResolveRecoveryConflictWithVirtualXIDs, ResolveRecoveryConflictWithDatabase, StandbyTimeoutHandler)
> I think we already have the information about the WAL record being
> applied in errcontext.
right, but it won’t be displayed in case log_error_verbosity is set to
terse.
Or did you had in mind to try to avoid using the new
“current_replayed_xlog” in xlog.c?

>
> I wonder if we can show the recovery conflict information in the main
> LOG message, the blocker information in errdetail, and use errcontext
> with regard to WAL record information. For example:
>
> LOG: process 500 waiting for recovery conflict on snapshot
> DETAIL: conflicting transition id: 100, 200, 300
> CONTEXT: WAL redo at 0/3001970 for Heap2/CLEAN: remxid 506
Not sure at all if it would be possible to put all the information at
the same time.
For example in case of shared buffer pin conflict, the blocker is
currently known “only” during the RecoveryConflictInterrupt call (so
still not known yet when we can “already” report the blocked LSN
information).
It might also happen that the blocker(s) will never get interrupted (was
not blocking anymore before max_standby_streaming_delay has been
reached): then it would not be possible to display all the information
here (aka when it is interrupted) while we still want to be warned that
the WAL replay has been blocked.
>
>> Outcome Example:
>>
>> 2020-06-15 06:48:23.774 UTC [6971] LOG: wal record apply is blocked by 2 connection(s), reason: User query might have needed to see row versions that must be removed.
>> 2020-06-15 06:48:23.774 UTC [6971] LOG: blocked wal record rmgr: Heap2, lsn: 0/038E2678, received at: 2020-06-15 06:48:23.774098+00, desc: CLEAN, relation: rel 1663/13586/16652 fork main blk 0
>> 2020-06-15 06:48:54.773 UTC [7088] LOG: about to interrupt pid: 7088, backend_type: client backend, state: active, wait_event_type: Timeout, wait_event: PgSleep, query_start: 2020-06-15 06:48:14.87672+00
> I'm concerned that the above information on the process who is about
> to be interrupted is very detailed but I'm not sure it will be helpful
> for the users. If the blocker is waiting on something lock, the
> information should be logged by log_lock_waits.

The blocker could also be in “idle in transaction”, say for example, on
the standby (with hot_standby_feedback set to off):

standby> begin;
standby> SET TRANSACTION ISOLATION LEVEL REPEATABLE READ;
standby> select * from bdt;

on the master:

master> update bdt set generate_series = 15;
master> vacuum verbose bdt;

would produce:

2020-07-01 09:18:55.256 UTC [32751] LOG:  about to interrupt pid: 32751,
backend_type: client backend, state: idle in transaction,
wait_event_type: Client, wait_event: ClientRead, query_start: 2020-07-01
09:18:17.390572+00
2020-07-01 09:18:55.256 UTC [32751] DETAIL:  statement: select * from bdt;

I think those information are useful to have (might get rid of
wait_event_type though: done in the new attached patch).

> Also the canceled
> backend will emit the ERROR log with the message "canceling statement
> due to conflict with recovery”, and its pid can be logged at the log
> prefix. The user can know who has been canceled by recovery conflicts
> and the query.

Right, we can also get rid of the pid and the query in the new log
message too (done in the new attached patch).

> Probably we need to consider having a time threshold (or boolean to
> turn on/off) to emit this information to the server logs. Otherwise,
> we will end up writing every conflict information, making the log
> dirty needlessly.

good point, I do agree (done in the new attached patch).

Bertrand

>
> Regards,
>
> --
> Masahiko Sawada http://www.2ndQuadrant.com/
> PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachment Content-Type Size
v1-0002-Add-Standby-Recovery-Conflicts-Info.patch text/plain 15.8 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message torikoshia 2020-07-01 13:15:08 Re: Creating a function for exposing memory usage of backend process
Previous Message Amul Sul 2020-07-01 12:52:25 Re: Cleanup - adjust the code crossing 80-column window limit