Add Information during standby recovery conflicts

From: "Drouvot, Bertrand" <bdrouvot(at)amazon(dot)com>
To: <pgsql-hackers(at)postgresql(dot)org>
Subject: Add Information during standby recovery conflicts
Date: 2020-06-18 07:28:13
Message-ID: 9a60178c-a853-1440-2cdc-c3af916cff59@amazon.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi hackers,

I've attached a patch to add information during standby recovery conflicts.
The motive behind is to be able to get some information:

* On the apply side
* On the blocker(s) side

_Motivation:_

When a standby recovery conflict occurs it could be useful to get more
information to be able to dive deep on the root cause and find a way to
avoid/mitigate new occurrences.

Adding this information would make the investigations easier, it could
help answering questions like:

* On which LSN was the WAL apply blocked?
* What was the purpose of the bocked WAL record?
* On which relation (if any) was the blocked WAL record related to?
* What was the blocker(s) doing?
* When did the blocker(s) started their queries (if any)?
* What was the blocker(s) waiting for? on which wait event?

_Technical context and proposal:_

There is 2 points in this patch:

* Add the information about the blocked WAL record. This is done in
standby.c (ResolveRecoveryConflictWithVirtualXIDs,
ResolveRecoveryConflictWithDatabase, StandbyTimeoutHandler)
* Add the information about the blocker(s). This is done in postgres.c
(RecoveryConflictInterrupt)

_Outcome Example:_

2020-06-15 06:48:23.774 UTC [6971] LOG: wal record apply is blocked by 2 connection(s), reason: User query might have needed to see row versions that must be removed.
2020-06-15 06:48:23.774 UTC [6971] LOG: blocked wal record rmgr: Heap2, lsn: 0/038E2678, received at: 2020-06-15 06:48:23.774098+00, desc: CLEAN, relation: rel 1663/13586/16652 fork main blk 0
2020-06-15 06:48:54.773 UTC [7088] LOG: about to interrupt pid: 7088, backend_type: client backend, state: active, wait_event_type: Timeout, wait_event: PgSleep, query_start: 2020-06-15 06:48:14.87672+00
2020-06-15 06:48:54.773 UTC [7088] DETAIL: statement: select *, pg_sleep(120) from bdt;
2020-06-15 06:48:54.773 UTC [7088] STATEMENT: select *, pg_sleep(120) from bdt;
2020-06-15 06:48:54.773 UTC [7088] ERROR: canceling statement due to conflict with recovery
2020-06-15 06:48:54.773 UTC [7088] DETAIL: User query might have needed to see row versions that must be removed.
2020-06-15 06:48:54.773 UTC [7088] STATEMENT: select *, pg_sleep(120) from bdt;
2020-06-15 06:48:54.778 UTC [7037] LOG: about to interrupt pid: 7037, backend_type: client backend, state: active, wait_event_type: Timeout, wait_event: PgSleep, query_start: 2020-06-15 06:48:13.008427+00
2020-06-15 06:48:54.778 UTC [7037] DETAIL: statement: select *, pg_sleep(300) from bdt;
2020-06-15 06:48:54.778 UTC [7037] STATEMENT: select *, pg_sleep(300) from bdt;
2020-06-15 06:48:54.778 UTC [7037] ERROR: canceling statement due to conflict with recovery
2020-06-15 06:48:54.778 UTC [7037] DETAIL: User query might have needed to see row versions that must be removed.
2020-06-15 06:48:54.778 UTC [7037] STATEMENT: select *, pg_sleep(300) from bdt;

It takes care of the other conflicts reason too.

So, should a standby recovery conflict occurs, we could see:

* information about the blocked WAL record apply
* information about the blocker(s)

I will add this patch to the next commitfest. I look forward to your
feedback about the idea and/or implementation.

Regards,

Bertrand

Attachment Content-Type Size
v1-0001-Add-Standby-Recovery-Conflicts-Info.patch text/plain 12.3 KB

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Kyotaro Horiguchi 2020-06-18 07:32:59 Re: Review for GetWALAvailability()
Previous Message Peter Eisentraut 2020-06-18 07:13:35 Re: factorial of negative numbers