Re: archive status ".ready" files may be created too early

From: Andrey Borodin <x4mmm(at)yandex-team(dot)ru>
To: Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>
Cc: bossartn(at)amazon(dot)com, a(dot)lubennikova(at)postgrespro(dot)ru, hlinnaka(at)iki(dot)fi, matsumura(dot)ryo(at)fujitsu(dot)com, masao(dot)fujii(at)gmail(dot)com, pgsql-hackers(at)postgresql(dot)org
Subject: Re: archive status ".ready" files may be created too early
Date: 2021-01-02 16:54:39
Message-ID: EE2EA7B8-36E5-4C46-BD39-099713C4BC50@yandex-team.ru
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi!

I was looking to review something in CF. This seems like a thread of some interest to me.

Recently we had somewhat related incident. Do I understand correctly that this incident is related to the bug discussed in this thread?

Primary instance was killed by OOM
[ 2020-11-12 15:27:03.732 MSK ,,,739,00000 ]:LOG: server process (PID 40189) was terminated by signal 9: Killed
after recovery it archived some WAL segments.
[ 2020-11-12 15:27:31.477 MSK ,,,739,00000 ]:LOG: database system is ready to accept connections
INFO: 2020/11/12 15:27:32.059541 FILE PATH: 0000000E0001C02F000000AF.br
INFO: 2020/11/12 15:27:32.114319 FILE PATH: 0000000E0001C02F000000B3.br

then PITR failed on another host
[ 2020-11-12 16:26:33.024 MSK ,,,51414,00000 ]:LOG: restored log file "0000000E0001C02F000000B3" from archive
[ 2020-11-12 16:26:33.042 MSK ,,,51414,00000 ]:LOG: invalid record length at 1C02F/B3FFF778: wanted 24, got 0
[ 2020-11-12 16:26:33.042 MSK ,,,51414,00000 ]:LOG: invalid record length at 1C02F/B3FFF778: wanted 24, got 0

archived segment has some zeroes at the end
rmgr: XLOG len (rec/tot): 51/ 1634, tx: 0, lsn: 1C02F/B3FFF058, prev 1C02F/B3FFEFE8, desc: FPI_FOR_HINT , blkref #0: rel 1663/14030/16384 blk 140 FPW
rmgr: Heap len (rec/tot): 129/ 129, tx: 3890578935, lsn: 1C02F/B3FFF6C0, prev 1C02F/B3FFF058, desc: HOT_UPDATE off 34 xmax 3890578935 ; new off 35 xmax 0, blkref #0: rel 1663/14030/16384 blk 140
rmgr: Transaction len (rec/tot): 46/ 46, tx: 3890578935, lsn: 1C02F/B3FFF748, prev 1C02F/B3FFF6C0, desc: COMMIT 2020-11-12 15:27:31.507363 MSK
pg_waldump: FATAL: error in WAL record at 1C02F/**B3FFF748**: invalid record length at 1C02F/**B3FFF778**: wanted 24, got 0

Meanwhile next segment points to previous record at **B3FFF748**
postgres(at)man-odszl7u4361o8m3z:/tmp$ pg_waldump 0000000E0001C02F000000B4| head
rmgr: Heap len (rec/tot): 129/ 129, tx: 3890578936, lsn: 1C02F/B4000A68, prev 1C02F/**B3FFF778**, desc: HOT_UPDATE off 35 xmax 3890578936 ; new off 36 xmax 0, blkref #0: rel 1663/14030/16384 blk 140
rmgr: Transaction len (rec/tot): 46/ 46, tx: 3890578936, lsn: 1C02F/B4000AF0, prev 1C02F/B4000A68, desc: COMMIT 2020-11-12 15:27:32.509443 MSK

Best regards, Andrey Borodin.

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Jeff Davis 2021-01-02 18:44:20 Re: Reloptions for table access methods
Previous Message Michail Nikolaev 2021-01-02 16:00:04 Why latestRemovedXid|cuteoff_xid are always sent?