BUG #17928: Standby fails to decode WAL on termination of primary

From: PG Bug reporting form <noreply(at)postgresql(dot)org>
To: pgsql-bugs(at)lists(dot)postgresql(dot)org
Cc: exclusion(at)gmail(dot)com
Subject: BUG #17928: Standby fails to decode WAL on termination of primary
Date: 2023-05-11 08:00:01
Message-ID: 17928-aa92416a70ff44a2@postgresql.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

The following bug has been logged on the website:

Bug reference: 17928
Logged by: Alexander Lakhin
Email address: exclusion(at)gmail(dot)com
PostgreSQL version: 15.2
Operating system: Ubuntu 22.04
Description:

When a primary server is stopped, standby might fail to replay WAL with
the memory allocation error, for example:
2023-05-11 08:42:38.254 MSK [1706654] LOG: recovery restart point at
7/12012E88
2023-05-11 08:42:38.254 MSK [1706654] DETAIL: Last completed transaction
was at log time 2023-05-11 08:42:38.245399+03.
2023-05-11 08:42:38.420 MSK [1706654] LOG: restartpoint starting: wal
2023-05-11 08:42:38.451 MSK [1725929] FATAL: could not receive data from
WAL stream: server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
2023-05-11 08:42:38.468 MSK [1706656] FATAL: invalid memory alloc request
size 2021163525
2023-05-11 08:42:38.469 MSK [1706643] LOG: startup process (PID 1706656)
exited with exit code 1
2023-05-11 08:42:38.469 MSK [1706643] LOG: terminating any other active
server processes
2023-05-11 08:42:38.469 MSK [1706643] LOG: shutting down due to startup
process failure
2023-05-11 08:42:38.470 MSK [1706643] LOG: database system is shut down

The call stack of the error:
...
#6 0x000055cfabab8e97 in palloc_extended (size=2021163525, flags=2) at
mcxt.c:1143
#7 0x000055cfab456ae0 in XLogReadRecordAlloc (state=0x55cfacef4a08,
xl_tot_len=2021161080,
allow_oversized=true) at xlogreader.c:524
#8 0x000055cfab456dee in XLogDecodeNextRecord (state=0x55cfacef4a08,
nonblocking=false) at xlogreader.c:689
#9 0x000055cfab4575c0 in XLogReadAhead (state=0x55cfacef4a08,
nonblocking=false) at xlogreader.c:960
#10 0x000055cfab45519d in XLogPrefetcherNextBlock
(pgsr_private=94350447959632, lsn=0x55cfacf4bc58)
at xlogprefetcher.c:496
#11 0x000055cfab454b1c in lrq_prefetch (lrq=0x55cfacf4bb08) at
xlogprefetcher.c:256
#12 0x000055cfab454cdf in lrq_complete_lsn (lrq=0x55cfacf4bb08,
lsn=2976923632) at xlogprefetcher.c:294
#13 0x000055cfab455df2 in XLogPrefetcherReadRecord
(prefetcher=0x55cfacef5e50, errmsg=0x7ffc2e410458)
at xlogprefetcher.c:1039
#14 0x000055cfab45f136 in ReadRecord (xlogprefetcher=0x55cfacef5e50,
emode=15, fetching_ckpt=false,
replayTLI=1) at xlogrecovery.c:3047
#15 0x000055cfab45c922 in PerformWalRecovery () at xlogrecovery.c:1754
#16 0x000055cfab448869 in StartupXLOG () at xlog.c:5300
#17 0x000055cfab7d151a in StartupProcessMain () at startup.c:267
#18 0x000055cfab7c3f80 in AuxiliaryProcessMain (auxtype=StartupProcess) at
auxprocess.c:141
#19 0x000055cfab7cfb4b in StartChildProcess (type=StartupProcess) at
postmaster.c:5429
#20 0x000055cfab7ca5c0 in PostmasterMain (argc=4, argv=0x55cfacef3e40) at
postmaster.c:1470
#21 0x000055cfab6be302 in main (argc=4, argv=0x55cfacef3e40) at main.c:202

(gdb) frame 8
#8 0x000055cfab456dee in XLogDecodeNextRecord (state=0x55cfacef4a08,
nonblocking=false) at xlogreader.c:689
689 decoded = XLogReadRecordAlloc(state,
(gdb) print/x *record
$5 = {xl_tot_len = 0x78787878, xl_xid = 0x78787878, xl_prev =
0x7878787878787878, xl_info = 0x0,
xl_rmid = 0x10, xl_crc = 0x0}
(gdb) print/x state->NextRecPtr
$6 = 0xb1703ff0

hexdump -C .../standby_data/pgdata/pg_wal/0000000100000000000000B1
00703fe0 00 00 04 40 00 00 0c 56 00 00 00 00 00 00 00 00 (dot)(dot)(dot)(at)(dot)(dot)(dot)V(dot)(dot)(dot)(dot)(dot)(dot)(dot)(dot)
00703ff0 78 78 78 78 78 78 78 78 78 78 78 78 78 78 78 78 xxxxxxxxxxxxxxxx
00704000 10 d1 05 00 01 00 00 00 00 40 70 30 00 00 00 00 (dot)(dot)(dot)(dot)(dot)(dot)(dot)(dot)(dot)(at)p0(dot)(dot)(dot)(dot)
00704010 db 00 00 00 00 00 00 00 78 78 78 78 78 78 78 78 ........xxxxxxxx
00704020 78 78 78 78 78 78 78 78 78 78 78 78 78 78 78 78 xxxxxxxxxxxxxxxx
*
007040f0 09 00 00 00 00 00 00 00 33 02 00 00 1b 0d 00 00 ........3.......

hexdump -C .../primary_data/pgdata/pg_wal/0000000100000000000000B1
00703fe0 00 00 04 40 00 00 0c 56 00 00 00 00 00 00 00 00 (dot)(dot)(dot)(at)(dot)(dot)(dot)V(dot)(dot)(dot)(dot)(dot)(dot)(dot)(dot)
00703ff0 33 02 00 00 54 2a 00 00 e0 3e 70 b1 00 00 00 00 3...T*...>p.....
00704000 10 d1 05 00 01 00 00 00 00 40 70 b1 00 00 00 00 (dot)(dot)(dot)(dot)(dot)(dot)(dot)(dot)(dot)(at)p(dot)(dot)(dot)(dot)(dot)
00704010 23 02 00 00 00 00 00 00 80 0a 00 00 55 0f 07 2c #...........U..,
00704020 00 60 02 02 7f 06 00 00 05 00 00 00 f7 85 01 00 .`..............
00704030 00 00 00 00 ff 03 02 00 02 08 18 00 01 00 00 00 ................
00704040 e0 07 00 00 78 78 78 78 78 78 78 78 78 78 78 78 ....xxxxxxxxxxxx
00704050 78 78 78 78 78 78 78 78 78 78 78 78 78 78 78 78 xxxxxxxxxxxxxxxx
*
00704230 78 78 78 78 78 78 78 78 01 00 00 00 00 00 00 00 xxxxxxxx........

(In this case, rows with 500 'x' were inserted when the primary was
stopped.)

`git bisect` for this behavior blames 3f1ce9734 (where
XLogDecodeNextRecord() -> XLogReadRecordAlloc() call was introduced).

A reproducer for the anomaly to follow.

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message Alvaro Herrera 2023-05-11 09:14:37 Re: BUG #17926: Segfault in SELECT
Previous Message Daniel Gustafsson 2023-05-10 13:09:55 Re: BUG #17927: Postgres does not compile with LLVM-16