Re: pg ignores wal files in pg_wal, and instead tries to load them from archive/primary

From: hubert depesz lubaczewski <depesz(at)depesz(dot)com>
To: Michael Paquier <michael(at)paquier(dot)xyz>
Cc: PostgreSQL mailing lists <pgsql-bugs(at)lists(dot)postgresql(dot)org>
Subject: Re: pg ignores wal files in pg_wal, and instead tries to load them from archive/primary
Date: 2022-09-30 04:50:38
Message-ID: YzZ1ngq++Tazursx@depesz.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

On Fri, Sep 30, 2022 at 09:49:28AM +0900, Michael Paquier wrote:
> On Thu, Sep 29, 2022 at 05:51:02PM +0200, hubert depesz lubaczewski wrote:
> > What am I missing? what is wrong? How can I fix it? The problem is not fixing
> > *this server*, because we are in process of upgrading LOTS and LOTS of servers,
> > and I need to know what is broken/how to work around it.
> So, your backup_label file points to 00000001000007E4000000A0 as being
> the first segment to begin recovery from. This means that the startup

Well, *now* I don't have backup_label:

root(at)c273-pg2258:/var/lib/postgresql/14/main# ls -l
total 68520
-rw------- 1 postgres postgres 3 Sep 28 01:38 PG_VERSION
-rw------- 1 postgres postgres 61264030 Sep 30 04:42 autoprewarm.blocks
-rw------- 1 postgres postgres 231 Sep 28 01:38 backup_label.old
-rw------- 1 postgres postgres 8783060 Sep 28 01:38 backup_manifest
drwx------ 6 postgres postgres 74 Sep 28 01:38 base
-rw------- 1 postgres postgres 61 Sep 30 04:00 current_logfiles
drwx------ 2 postgres postgres 4096 Sep 29 14:25 global
drwx------ 2 postgres postgres 6 Sep 28 01:38 pg_commit_ts
drwx------ 2 postgres postgres 6 Sep 28 01:38 pg_dynshmem
drwx------ 4 postgres postgres 84 Sep 28 03:38 pg_logical
drwx------ 4 postgres postgres 48 Sep 28 01:38 pg_multixact
drwx------ 2 postgres postgres 6 Sep 28 01:38 pg_notify
drwx------ 2 postgres postgres 6 Sep 28 01:38 pg_replslot
drwx------ 2 postgres postgres 6 Sep 28 01:38 pg_serial
drwx------ 2 postgres postgres 6 Sep 28 01:38 pg_snapshots
drwx------ 2 postgres postgres 6 Sep 29 14:25 pg_stat
drwx------ 2 postgres postgres 43 Sep 29 14:25 pg_stat_tmp
drwx------ 2 postgres postgres 122 Sep 29 14:25 pg_subtrans
drwx------ 2 postgres postgres 27 Sep 28 01:38 pg_tblspc
drwx------ 2 postgres postgres 6 Sep 28 01:38 pg_twophase
drwx------ 3 postgres postgres 204800 Sep 28 03:38 pg_wal
drwx------ 2 postgres postgres 4096 Sep 28 01:38 pg_xact
-rw------- 1 postgres postgres 118 Sep 28 01:38 postgresql.auto.conf
-rw------- 1 postgres postgres 130 Sep 29 14:25 postmaster.opts
-rw------- 1 postgres postgres 103 Sep 29 14:25 postmaster.pid
-rw-r--r-- 1 postgres postgres 0 Sep 28 01:38 standby.signal

backup_label.old is:
root(at)c273-pg2258:/var/lib/postgresql/14/main# cat backup_label.old
START WAL LOCATION: 7D5/3B068338 (file 00000001000007D50000003B)
CHECKPOINT LOCATION: 7D5/3B076F78
BACKUP METHOD: streamed
BACKUP FROM: primary
START TIME: 2022-09-27 21:06:00 UTC
LABEL: pg_basebackup base backup
START TIMELINE: 1

> process is able to correctly kick recovery using the contents of the
> local pg_wal/ from 00000001000007E4000000A0 to 00000001000007E800000066.
> Once 00000001000007E800000067 cannot be read, the startup process
> fails to read this segment and decides to switch the source to get WAL
> from as streaming. The order should be local pg_wal/, then archiving
> and finally streaming if all are configured, as far as I recall.
> What's the error you are seeing when reading 00000001000007E800000067?

I don't see any errors. Let me show you some data from log:
2022-09-30 04:44:43.833 UTC,"postgres","postgres",2716081,"[local]",6336743b.2971b1,3,"authentication",2022-09-30 04:44:43 UTC,4/14269,0,LOG,00000,"connection authorized: user=postgres database=postgres application_name=psql",,,,,,,,,"","client backend",,1147616880456321454
2022-09-30 04:44:43.917 UTC,,,2716082,,6336743b.2971b2,1,,2022-09-30 04:44:43 UTC,,0,LOG,00000,"started streaming WAL from primary at 7E8/67000000 on timeline 1",,,,,,,,,"","walreceiver",,0
2022-09-30 04:44:43.918 UTC,,,2716082,,6336743b.2971b2,2,,2022-09-30 04:44:43 UTC,,0,FATAL,08P01,"could not receive data from WAL stream: ERROR: requested WAL segment 00000001000007E800000067 has already been removed",,,,,,,,,"","walreceiver",,0
2022-09-30 04:44:48.922 UTC,,,2716187,,63367440.29721b,1,,2022-09-30 04:44:48 UTC,,0,LOG,00000,"started streaming WAL from primary at 7E8/67000000 on timeline 1",,,,,,,,,"","walreceiver",,0
2022-09-30 04:44:48.922 UTC,,,2716187,,63367440.29721b,2,,2022-09-30 04:44:48 UTC,,0,FATAL,08P01,"could not receive data from WAL stream: ERROR: requested WAL segment 00000001000007E800000067 has already been removed",,,,,,,,,"","walreceiver",,0
2022-09-30 04:44:51.867 UTC,,,2716207,"[local]",63367443.29722f,1,"",2022-09-30 04:44:51 UTC,,0,LOG,00000,"connection received: host=[local]",,,,,,,,,"","not initialized",,0
2022-09-30 04:44:51.868 UTC,"postgres","db",2716207,"[local]",63367443.29722f,2,"authentication",2022-09-30 04:44:51 UTC,4/14271,0,LOG,00000,"connection authenticated: identity=""postgres"" method=peer (/etc/postgresql/14/main/pg_hba.conf:11)",,,,,,,,,"","client backend",,-659920678270393094
2022-09-30 04:44:51.868 UTC,"postgres","db",2716207,"[local]",63367443.29722f,3,"authentication",2022-09-30 04:44:51 UTC,4/14271,0,LOG,00000,"connection authorized: user=postgres database=db application_name=psql",,,,,,,,,"","client backend",,-659920678270393094
2022-09-30 04:44:53.751 UTC,,,2716218,"127.0.0.1:39078",63367445.29723a,1,"",2022-09-30 04:44:53 UTC,,0,LOG,00000,"connection received: host=127.0.0.1 port=39078",,,,,,,,,"","not initialized",,0
2022-09-30 04:44:53.927 UTC,,,2716219,,63367445.29723b,1,,2022-09-30 04:44:53 UTC,,0,LOG,00000,"started streaming WAL from primary at 7E8/67000000 on timeline 1",,,,,,,,,"","walreceiver",,0
2022-09-30 04:44:53.927 UTC,,,2716219,,63367445.29723b,2,,2022-09-30 04:44:53 UTC,,0,FATAL,08P01,"could not receive data from WAL stream: ERROR: requested WAL segment 00000001000007E800000067 has already been removed",,,,,,,,,"","walreceiver",,0
2022-09-30 04:44:58.931 UTC,,,2716255,,6336744a.29725f,1,,2022-09-30 04:44:58 UTC,,0,LOG,00000,"started streaming WAL from primary at 7E8/67000000 on timeline 1",,,,,,,,,"","walreceiver",,0
2022-09-30 04:44:58.931 UTC,,,2716255,,6336744a.29725f,2,,2022-09-30 04:44:58 UTC,,0,FATAL,08P01,"could not receive data from WAL stream: ERROR: requested WAL segment 00000001000007E800000067 has already been removed",,,,,,,,,"","walreceiver",,0
2022-09-30 04:45:03.936 UTC,,,2716366,,6336744f.2972ce,1,,2022-09-30 04:45:03 UTC,,0,LOG,00000,"started streaming WAL from primary at 7E8/67000000 on timeline 1",,,,,,,,,"","walreceiver",,0
2022-09-30 04:45:03.936 UTC,,,2716366,,6336744f.2972ce,2,,2022-09-30 04:45:03 UTC,,0,FATAL,08P01,"could not receive data from WAL stream: ERROR: requested WAL segment 00000001000007E800000067 has already been removed",,,,,,,,,"","walreceiver",,0
2022-09-30 04:45:08.751 UTC,,,2716401,"127.0.0.1:45926",63367454.2972f1,1,"",2022-09-30 04:45:08 UTC,,0,LOG,00000,"connection received: host=127.0.0.1 port=45926",,,,,,,,,"","not initialized",,0
2022-09-30 04:45:08.940 UTC,,,2716402,,63367454.2972f2,1,,2022-09-30 04:45:08 UTC,,0,LOG,00000,"started streaming WAL from primary at 7E8/67000000 on timeline 1",,,,,,,,,"","walreceiver",,0
2022-09-30 04:45:08.940 UTC,,,2716402,,63367454.2972f2,2,,2022-09-30 04:45:08 UTC,,0,FATAL,08P01,"could not receive data from WAL stream: ERROR: requested WAL segment 00000001000007E800000067 has already been removed",,,,,,,,,"","walreceiver",,0
2022-09-30 04:45:12.960 UTC,,,2716425,"[local]",63367458.297309,1,"",2022-09-30 04:45:12 UTC,,0,LOG,00000,"connection received: host=[local]",,,,,,,,,"","not initialized",,0
2022-09-30 04:45:12.961 UTC,"postgres","postgres",2716425,"[local]",63367458.297309,2,"authentication",2022-09-30 04:45:12 UTC,4/14277,0,LOG,00000,"connection authenticated: identity=""postgres"" method=peer (/etc/postgresql/14/main/pg_hba.conf:11)",,,,,,,,,"","client backend",,1147616880456321454
2022-09-30 04:45:12.961 UTC,"postgres","postgres",2716425,"[local]",63367458.297309,3,"authentication",2022-09-30 04:45:12 UTC,4/14277,0,LOG,00000,"connection authorized: user=postgres database=postgres application_name=psql",,,,,,,,,"","client backend",,1147616880456321454
2022-09-30 04:45:13.946 UTC,,,2716431,,63367459.29730f,1,,2022-09-30 04:45:13 UTC,,0,LOG,00000,"started streaming WAL from primary at 7E8/67000000 on timeline 1",,,,,,,,,"","walreceiver",,0
2022-09-30 04:45:13.946 UTC,,,2716431,,63367459.29730f,2,,2022-09-30 04:45:13 UTC,,0,FATAL,08P01,"could not receive data from WAL stream: ERROR: requested WAL segment 00000001000007E800000067 has already been removed",,,,,,,,,"","walreceiver",,0

> Is pg_waldump able to understand its internals? This could point out
> to an issue with xlogreader.c, for example.

So... it can, but no records:

postgres(at)c273-pg2258:~/14/main$ /usr/lib/postgresql/14/bin/pg_waldump 00000001000007E800000067
pg_waldump: fatal: could not find a valid record after 7E8/67000000
postgres(at)c273-pg2258:~/14/main$ /usr/lib/postgresql/14/bin/pg_waldump 00000001000007E800000068
pg_waldump: fatal: could not find a valid record after 7E8/68000000

but previous one works:
postgres(at)c273-pg2258:~/14/main$ /usr/lib/postgresql/14/bin/pg_waldump 00000001000007E800000066 | tail -n 20
rmgr: Transaction len (rec/tot): 65/ 65, tx: 140005495, lsn: 7E8/66A0B010, prev 7E8/66A0AFC8, desc: COMMIT 2022-09-28 01:38:20.782352 UTC; origin: node 1, lsn CE64/A5086920, at 2022-09-28 01:38:20.780890 UTC
rmgr: Heap len (rec/tot): 54/ 54, tx: 140005496, lsn: 7E8/66A0B058, prev 7E8/66A0B010, desc: LOCK off 55: xid 140005496: flags 0x00 LOCK_ONLY EXCL_LOCK KEYS_UPDATED , blkref #0: rel 17963/16644/195315099 blk 83995
rmgr: Heap len (rec/tot): 165/ 165, tx: 140005496, lsn: 7E8/66A0B090, prev 7E8/66A0B058, desc: HOT_UPDATE off 55 xmax 140005496 flags 0x10 KEYS_UPDATED ; new off 74 xmax 140005496, blkref #0: rel 17963/16644/195315099 blk 83995
rmgr: Transaction len (rec/tot): 65/ 65, tx: 140005496, lsn: 7E8/66A0B138, prev 7E8/66A0B090, desc: COMMIT 2022-09-28 01:38:20.783036 UTC; origin: node 3, lsn CE64/A5087E00, at 2022-09-28 01:38:20.780992 UTC
rmgr: XLOG len (rec/tot): 51/ 980, tx: 0, lsn: 7E8/66A0B180, prev 7E8/66A0B138, desc: FPI_FOR_HINT , blkref #0: rel 17963/16644/195326137 blk 271085 FPW
rmgr: Heap len (rec/tot): 54/ 54, tx: 140005497, lsn: 7E8/66A0B558, prev 7E8/66A0B180, desc: LOCK off 13: xid 140005497: flags 0x00 LOCK_ONLY EXCL_LOCK KEYS_UPDATED , blkref #0: rel 17963/16644/195326137 blk 271085
rmgr: Heap len (rec/tot): 261/ 261, tx: 140005497, lsn: 7E8/66A0B590, prev 7E8/66A0B558, desc: HOT_UPDATE off 13 xmax 140005497 flags 0x10 KEYS_UPDATED ; new off 15 xmax 140005497, blkref #0: rel 17963/16644/195326137 blk 271085
rmgr: Transaction len (rec/tot): 65/ 65, tx: 140005497, lsn: 7E8/66A0B698, prev 7E8/66A0B590, desc: COMMIT 2022-09-28 01:38:20.784762 UTC; origin: node 1, lsn CE64/A508C7F0, at 2022-09-28 01:38:20.782858 UTC
rmgr: Heap len (rec/tot): 54/ 54, tx: 140005498, lsn: 7E8/66A0B6E0, prev 7E8/66A0B698, desc: LOCK off 8: xid 140005498: flags 0x00 LOCK_ONLY EXCL_LOCK KEYS_UPDATED , blkref #0: rel 17963/16644/195328502 blk 3038500
rmgr: Heap len (rec/tot): 54/ 54, tx: 140005498, lsn: 7E8/66A0B718, prev 7E8/66A0B6E0, desc: LOCK off 8: xid 140005498: flags 0x00 LOCK_ONLY EXCL_LOCK KEYS_UPDATED , blkref #0: rel 17963/16644/195328502 blk 3038500
rmgr: Heap len (rec/tot): 1821/ 1821, tx: 140005498, lsn: 7E8/66A0B750, prev 7E8/66A0B718, desc: UPDATE off 8 xmax 140005498 flags 0x10 KEYS_UPDATED ; new off 5 xmax 140005498, blkref #0: rel 17963/16644/195328502 blk 3038837, blkref #1: rel 17963/16644/195328502 blk 3038500
rmgr: Btree len (rec/tot): 64/ 64, tx: 140005498, lsn: 7E8/66A0BE70, prev 7E8/66A0B750, desc: INSERT_LEAF off 158, blkref #0: rel 17963/16644/195349680 blk 78396
rmgr: Btree len (rec/tot): 72/ 72, tx: 140005498, lsn: 7E8/66A0BEB0, prev 7E8/66A0BE70, desc: INSERT_LEAF off 72, blkref #0: rel 17963/16644/419942198 blk 82241
rmgr: Btree len (rec/tot): 64/ 64, tx: 140005498, lsn: 7E8/66A0BEF8, prev 7E8/66A0BEB0, desc: INSERT_LEAF off 268, blkref #0: rel 17963/16644/419946394 blk 18486
rmgr: Btree len (rec/tot): 64/ 64, tx: 140005498, lsn: 7E8/66A0BF38, prev 7E8/66A0BEF8, desc: INSERT_LEAF off 69, blkref #0: rel 17963/16644/419946856 blk 56634
rmgr: Btree len (rec/tot): 64/ 64, tx: 140005498, lsn: 7E8/66A0BF78, prev 7E8/66A0BF38, desc: INSERT_LEAF off 98, blkref #0: rel 17963/16644/419947709 blk 19044
rmgr: Btree len (rec/tot): 64/ 64, tx: 140005498, lsn: 7E8/66A0BFB8, prev 7E8/66A0BF78, desc: INSERT_LEAF off 53, blkref #0: rel 17963/16644/419948176 blk 10715
rmgr: Transaction len (rec/tot): 65/ 65, tx: 140005498, lsn: 7E8/66A0BFF8, prev 7E8/66A0BFB8, desc: COMMIT 2022-09-28 01:38:20.785822 UTC; origin: node 1, lsn CE64/A5091128, at 2022-09-28 01:38:20.784288 UTC
rmgr: XLOG len (rec/tot): 34/ 34, tx: 0, lsn: 7E8/66A0C058, prev 7E8/66A0BFF8, desc: BACKUP_END 7D5/3B068338
rmgr: XLOG len (rec/tot): 24/ 24, tx: 0, lsn: 7E8/66A0C080, prev 7E8/66A0C058, desc: SWITCH

> Another thing that has changed in this area is related to continuation
> records. What are the WAL contents at the end of
> 00000001000007E800000066? Do you spot an OVERWRITE_CONTRECORD when
> using pg_waldump on it?

No. There is no such thing in pg_waldump on *66 file.

If I understand correctly - this would suggest that the .66 file is the end of
wal taken from pg_basebackup. What are the 67 and next files then?

Best regards,

depesz

In response to

Browse pgsql-bugs by date

  From Date Subject
Next Message Frank van Vugt 2022-09-30 06:33:17 v15rc1 shadowbuild fails when using src from symlink
Previous Message Yugo NAGATA 2022-09-30 01:23:42 Re: BUG #17434: CREATE/DROP DATABASE can be executed in the same transaction with other commands