pg ignores wal files in pg_wal, and instead tries to load them from archive/primary

From: hubert depesz lubaczewski <depesz(at)depesz(dot)com>
To: PostgreSQL mailing lists <pgsql-bugs(at)lists(dot)postgresql(dot)org>
Subject: pg ignores wal files in pg_wal, and instead tries to load them from archive/primary
Date: 2022-09-29 15:51:02
Message-ID: YzW+5v/VwbguW+XU@depesz.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

Hi,
we have following situation:

1. primary on 14.5 that is *not* archiving (this is temporary situation
related to ongoing upgrade from pg 12 proces) - all on ubuntu focal.
2. on new replica we run (via wrapper, but this doesn't seem to be
related):
pg_basebackup -D /var/lib/postgresql/14/main -c fast -v -P -U some-user -h sourcedb.hostname
3. after it is done, if the datadir was large enough, pg on replica
doesn't replicate/catchup, because, from logs:
2022-09-29 14:59:26.587 UTC,,,2355588,,6335b2ce.23f184,1,,2022-09-29 14:59:26 UTC,,0,LOG,00000,"started streaming WAL from primary at 7E8/67000000 on timeline 1",,,,,,,,,"","walreceiver",,0
2022-09-29 14:59:26.587 UTC,,,2355588,,6335b2ce.23f184,2,,2022-09-29 14:59:26 UTC,,0,FATAL,08P01,"could not receive data from WAL stream: ERROR: requested WAL segment 00000001000007E800000067 has already been removed",,,,,,,,,"","walreceiver",,0
4. if there is restore_command configured, it tries to read data from archive
too, but archive is non-existant.
5. the "missing" file is there, in pg_wal (I would assume that
pg_basebackup copied it there):
root(at)host# /bin/ls -c1 0* | wc -l
1068
root(at)host# /bin/ls -c1 0* | sort -V | head -n 1
00000001000007E4000000A0
root(at)host# /bin/ls -c1 0* | sort -V | tail -n 1
00000001000007E800000092
root(at)host# /bin/ls -c1 0* | sort -V | grep -n 00000001000007E800000067
1043:00000001000007E800000067
root(at)host# /bin/ls -c1 0* | sort -V | grep -n -C5 00000001000007E800000067
1038-00000001000007E800000062
1039-00000001000007E800000063
1040-00000001000007E800000064
1041-00000001000007E800000065
1042-00000001000007E800000066
1043:00000001000007E800000067
1044-00000001000007E800000068
1045-00000001000007E800000069
1046-00000001000007E800000070
1047-00000001000007E800000071
1048-00000001000007E800000072
6. What's more - I straced startup process, and it does:
a. opens the wal file (the problematic one)
b. read 8k form it
c. closes it
d. checks existence of finish.recovery trigger file (it doesn't exist)
e. starts restore program (which fails).
f. rinse and repeat

What am I missing? what is wrong? How can I fix it? The problem is not fixing
*this server*, because we are in process of upgrading LOTS and LOTS of servers,
and I need to know what is broken/how to work around it.

Currently our goto fix is:
1. increase wal_keep_size to ~ 200GB
2. standaup replica
3. once it catches up decrease wal_keep_size to standard (for us) 16GB

but it is not really nice solution.

Best regards,

depesz

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message PG Bug reporting form 2022-09-29 18:23:20 BUG #17625: In PG15 PQsslAttribute returns different values than PG14 when SSL is not in use for the connection
Previous Message Bertrand Mutangana 2022-09-29 14:40:19 Re: BUG #17624: Creating database is non-ending execution.