Re: [proposal] recovery_target "latest"

From: Grigory Smolkin <g(dot)smolkin(at)postgrespro(dot)ru>
To: Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, pgsql-hackers(at)lists(dot)postgresql(dot)org
Cc: Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>
Subject: Re: [proposal] recovery_target "latest"
Date: 2019-12-08 01:03:01
Message-ID: 83f0a519-4e2b-ab43-c6ed-cc3577a29fde@postgrespro.ru
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 11/21/19 1:46 PM, Peter Eisentraut wrote:
> On 2019-11-08 05:00, Grigory Smolkin wrote:
>> Attached new patch revision, now end of available WAL is defined as the
>> fact of missing required WAL.
>> In case of standby, the end of WAL is defined as 2 consecutive switches
>> of WAL source, that didn`t provided requested record.
>> In case of streaming standby, each switch of WAL source is forced after
>> 3 failed attempts to get new data from walreceiver.
>>
>> All constants are taken off the top of my head and serves as proof of
>> concept.
>
> Well, this is now changing the meaning of the patch quite a bit. I'm
> on board with making the existing default behavior explicit. (This is
> similar to how we added recovery_target_timeline = 'current' in PG12.)
> Still not a fan of the name yet, but that's trivial.
>
> If, however, you want to change the default behavior or introduce a
> new behavior, as you are suggesting here, that should be a separate
> discussion.

No, default behavior is not to be changed. As I previously mentioned, it
may break the backward compatibility for 3rd party backup and
replication applications.

> At Fri, 8 Nov 2019 16:08:47 +0300, Grigory Smolkin<g(dot)smolkin(at)postgrespro(dot)ru> wrote in
>> On 11/8/19 7:00 AM, Grigory Smolkin wrote:
>>> On 11/7/19 4:36 PM, Grigory Smolkin wrote:
>>>> I gave it some thought and now think that prohibiting recovery_target
>>>> 'latest' and standby was a bad idea.
>>>> All recovery_targets follow the same pattern of usage, so
>>>> recovery_target "latest" also must be capable of working in standby
>>>> mode.
>>>> All recovery_targets have a clear deterministic 'target' where
>>>> recovery should end.
>>>> In case of recovery_target "latest" this target is the end of
>>>> available WAL, therefore the end of available WAL must be more clearly
>>>> defined.
>>>> I will work on it.
>>>>
>>>> Thank you for a feedback.
>>> Attached new patch revision, now end of available WAL is defined as
>>> the fact of missing required WAL.
>>> In case of standby, the end of WAL is defined as 2 consecutive
>>> switches of WAL source, that didn`t provided requested record.
>>> In case of streaming standby, each switch of WAL source is forced
>>> after 3 failed attempts to get new data from walreceiver.
>>>
>>> All constants are taken off the top of my head and serves as proof of
>>> concept.
>>> TAP test is updated.
>>>
>> Attached new revision, it contains some minor refactoring.
> Thanks for the new patch. I found that it needs more than I thought,
> but it seems a bit too complicated and less stable.
>
> As the patch does, WaitForWALToBecomeAvailable needs to exit when
> avaiable sources are exhausted. However, we don't need to count
> failures to do that. It is enough that the function have two more exit
> point. When streaming timeout fires, and when we found that streaming
> is not set up when archive/wal failed.
>
> In my opinion, it is better that we have less dependency to global
> variables in such deep levels in a call hierachy. Such nformation can
> be stored in XLogPageReadPrivate.

Many thanks!
It looks much better that my approach with global variables.

I`ve tested it and have some thoughts/concerns:

1. Recovery should report the exact reason why it has been forced to
stop. In case of recovering to the end of WAL, standby promotion request
received during recovery could be mistaken for reaching the end of WAL
and reported as such. To avoid this, I think that reachedEndOfWal
variable should be introduced.

In attached patch it is added as a global variable, but maybe something
more clever may be devised. I was not sure that reachedEndOfWal could be
placed in XLogPageReadPrivate. Because we need to access it at the
higher level than ReadRecord(), and I was under impression placing it in
XLogPageReadPrivate could violate abstraction level of XLogPageReadPrivate.

2. During the testing, I`ve stumbled upon assertion failure in case of
recovering in standby mode to the the end of WAL coupled with
recovery_target_action as "promote", caused by the WAL source in state
machine not been changed after reaching the recovery target (script to
reproduce is attached):

2019-12-07 13:45:54.468 MSK [23067] LOG:  starting PostgreSQL 13devel on
x86_64-pc-linux-gnu, compiled by gcc (GCC) 9.2.1 20190827 (Red Hat
9.2.1-1), 64-bit
2019-12-07 13:45:54.468 MSK [23067] LOG:  listening on IPv4 address
"127.0.0.1", port 15433
2019-12-07 13:45:54.470 MSK [23067] LOG:  listening on Unix socket
"/tmp/.s.PGSQL.15433"
2019-12-07 13:45:54.475 MSK [23068] LOG:  database system was
interrupted; last known up at 2019-12-07 13:45:49 MSK
cp: cannot stat '/home/gsmol/task/13_devel/archive/00000002.history': No
such file or directory
2019-12-07 13:45:54.602 MSK [23068] LOG:  entering standby mode
2019-12-07 13:45:54.614 MSK [23068] LOG:  restored log file
"000000010000000000000002" from archive
2019-12-07 13:45:54.679 MSK [23068] LOG:  redo starts at 0/2000028
2019-12-07 13:45:54.682 MSK [23068] LOG:  consistent recovery state
reached at 0/2000100
2019-12-07 13:45:54.682 MSK [23067] LOG:  database system is ready to
accept read only connections
2019-12-07 13:45:54.711 MSK [23068] LOG:  restored log file
"000000010000000000000003" from archive
2019-12-07 13:45:54.891 MSK [23068] LOG:  restored log file
"000000010000000000000004" from archive
2019-12-07 13:45:55.046 MSK [23068] LOG:  restored log file
"000000010000000000000005" from archive
2019-12-07 13:45:55.210 MSK [23068] LOG:  restored log file
"000000010000000000000006" from archive
2019-12-07 13:45:55.377 MSK [23068] LOG:  restored log file
"000000010000000000000007" from archive
2019-12-07 13:45:55.566 MSK [23068] LOG:  restored log file
"000000010000000000000008" from archive
2019-12-07 13:45:55.737 MSK [23068] LOG:  restored log file
"000000010000000000000009" from archive
cp: cannot stat
'/home/gsmol/task/13_devel/archive/00000001000000000000000A': No such
file or directory
2019-12-07 13:45:56.233 MSK [23083] LOG:  started streaming WAL from
primary at 0/A000000 on timeline 1
2019-12-07 13:45:56.365 MSK [23068] LOG:  recovery stopping after
reaching the end of available WAL
2019-12-07 13:45:56.365 MSK [23068] LOG:  redo done at 0/9FFC670
2019-12-07 13:45:56.365 MSK [23068] LOG:  last completed transaction was
at log time 2019-12-07 13:45:53.627746+03
2019-12-07 13:45:56.365 MSK [23083] FATAL:  terminating walreceiver
process due to administrator command
TRAP: FailedAssertion("StandbyMode", File: "xlog.c", Line: 12032)
postgres: startup   waiting for
00000001000000000000000A(ExceptionalCondition+0xa8)[0xa88b55]
postgres: startup   waiting for 00000001000000000000000A[0x573417]
postgres: startup   waiting for 00000001000000000000000A[0x572b68]
postgres: startup   waiting for 00000001000000000000000A[0x579066]
postgres: startup   waiting for
00000001000000000000000A(XLogReadRecord+0xe3)[0x5788ac]
postgres: startup   waiting for 00000001000000000000000A[0x5651f8]
postgres: startup   waiting for
00000001000000000000000A(StartupXLOG+0x23aa)[0x56b26e]
postgres: startup   waiting for
00000001000000000000000A(StartupProcessMain+0xc7)[0x8642a1]
postgres: startup   waiting for
00000001000000000000000A(AuxiliaryProcessMain+0x5b8)[0x5802ad]
postgres: startup   waiting for 00000001000000000000000A[0x863175]
postgres: startup   waiting for
00000001000000000000000A(PostmasterMain+0x1214)[0x85e0a4]
postgres: startup   waiting for 00000001000000000000000A[0x76f247]
/lib64/libc.so.6(__libc_start_main+0xf3)[0x7f8867958f33]
postgres: startup   waiting for
00000001000000000000000A(_start+0x2e)[0x47afee]

#0  0x00007f886796ce75 in raise () from /lib64/libc.so.6
#1  0x00007f8867957895 in abort () from /lib64/libc.so.6
#2  0x0000000000a88b82 in ExceptionalCondition (conditionName=0xb24acc
"StandbyMode", errorType=0xb208a7 "FailedAssertion",
    fileName=0xb208a0 "xlog.c", lineNumber=12032) at assert.c:67
#3  0x0000000000573417 in WaitForWALToBecomeAvailable (RecPtr=151003136,
randAccess=true, fetching_ckpt=false, tliRecPtr=167757424,
    return_on_eow=true) at xlog.c:12032
#4  0x0000000000572b68 in XLogPageRead (xlogreader=0xf08ed8,
targetPagePtr=150994944, reqLen=8192, targetRecPtr=167757424,
    readBuf=0xf37d50 "\002\321\005") at xlog.c:11611
#5  0x0000000000579066 in ReadPageInternal (state=0xf08ed8,
pageptr=167755776, reqLen=1672) at xlogreader.c:579
#6  0x00000000005788ac in XLogReadRecord (state=0xf08ed8,
RecPtr=167757424, errormsg=0x7fff1f5cdeb8) at xlogreader.c:300
#7  0x00000000005651f8 in ReadRecord (xlogreader=0xf08ed8,
RecPtr=167757424, emode=22, fetching_ckpt=false) at xlog.c:4271
#8  0x000000000056b26e in StartupXLOG () at xlog.c:7373
#9  0x00000000008642a1 in StartupProcessMain () at startup.c:196
#10 0x00000000005802ad in AuxiliaryProcessMain (argc=2,
argv=0x7fff1f5cea80) at bootstrap.c:451
#11 0x0000000000863175 in StartChildProcess (type=StartupProcess) at
postmaster.c:5461
#12 0x000000000085e0a4 in PostmasterMain (argc=3, argv=0xf082b0) at
postmaster.c:1392
#13 0x000000000076f247 in main (argc=3, argv=0xf082b0) at main.c:210
(gdb) frame 8
#8  0x000000000056b26e in StartupXLOG () at xlog.c:7373
7373            record = ReadRecord(xlogreader, LastRec, PANIC, false);
(gdb) p StandbyMode
$1 = false
(gdb) list
7368
7369            /*
7370             * Re-fetch the last valid or last applied record, so we
can identify the
7371             * exact endpoint of what we consider the valid portion
of WAL.
7372             */
7373            record = ReadRecord(xlogreader, LastRec, PANIC, false);
7374            EndOfLog = EndRecPtr;
7375
7376            /*
7377             * EndOfLogTLI is the TLI in the filename of the XLOG
segment containing

Both issues are fixed in the new patch version.
Any review and thoughts on the matters would be much appreciated.

>
> I think The doc needs to exiplain on the difference between default
> and latest.
Sure, I will work on it.
>
> Please find the attached, which illustrates the first two points of
> the aboves.
>
> regards.

--
Grigory Smolkin
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Attachment Content-Type Size
recovery_target_latest.sh application/x-shellscript 1.4 KB
target_latest_PoC_v2.patch text/x-patch 12.6 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Karl O. Pinc 2019-12-08 01:23:25 Re: proposal: minscale, rtrim, btrim functions for numeric
Previous Message Ranier Vilela 2019-12-07 23:42:38 RE: [Proposal] Level4 Warnings show many shadow vars