Re[2]: BUG #16894: PANIC: WAL contains references to invalid pages

From: Владимир Владимир <skilyazhnev(at)mail(dot)ru>
To: david(at)pgmasters(dot)net, michael(at)paquier(dot)xyz
Cc: pgsql-bugs(at)lists(dot)postgresql(dot)org, antkurochkin(at)gmail(dot)com
Subject: Re[2]: BUG #16894: PANIC: WAL contains references to invalid pages
Date: 2021-03-06 07:32:33
Message-ID: 1615015953.849822391@f335.i.mail.ru
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs


 
>
>05.03.2021 15:55, David Steele:
>> On 3/4/21 11:32 PM, Michael Paquier wrote:
>>> On Wed, Mar 03, 2021 at 06:49:02AM -0500, David Steele wrote:
>>>> OK, but shouldn't we have a full page write for this page after the
>>>> backup
>>>> starts, rather than the partial we seem to be seeing here? Is there any
>>>> condition where the full page write would be skipped legitimately,
>>>> or does
>>>> it point to a problem?
>>>
>>> That's how things work.  If they don't work this way for physical
>>> backups, we may have a problem.  At replay, the full page will be
>>> replayed if BKPIMAGE_APPLY is correctly set, as per
>>> XLogReadBufferForRedoExtended().  And XLogRecordAssemble()
>>> does the decision when building the record (just grep for
>>> needs_backup).
>>
>> That's exactly the problem. This WAL record appears to be the only
>> reference to the page the submitter could find in their WAL. Anton,
>> could you confirm that?
>>
>> If we can confirm that there is no FPI for this page after the initial
>> backup checkpoint, wouldn't that point to an issue?
>>
>>>> If Postgres is running correctly there is certainly no expectation for
>>>> support of this unusual use case, but I do think that this possibly
>>>> points
>>>> to an issue in Postgres, which under normal circumstances would be
>>>> very hard
>>>> to detect.
>>>
>>> Well, the report tells that this is an issue that happens on those
>>> fake files full of zeros, but are you sure that you have the sizing
>>> right?  I still don't see any evidence of anything broken based on the
>>> information gathered for full backups, FWIW.
>>
>> This is the size we validate for normal restores so pretty sure it is
>> accurate. Actually, postgres doesn't seem to care if the file is large
>> enough as long as it is present. The reason we precreate the files at
>> the correct size is because when postgres extends the file it is not
>> sparse and uses more space.
>>
>> Regards,
 
I have reproduced the situation with WAL files and zeroed pgbackrest files.
the same problem started on the first lsn in wal.
I attach the logs recovery process
 
    2021-03-06 00:36:13.784 MSK [95] LOG:  database system was interrupted; last known up at 2021-03-01 01:07:51 MSK
    2021-03-06 00:36:13.871 MSK [95] LOG:  starting point-in-time recovery to 2021-03-01 12:00:00+03
    2021-03-06 00:36:13.915 MSK [95] LOG:  restored log file "0000000B.history" from archive
    2021-03-06 00:36:14.158 MSK [95] LOG:  restored log file "0000000B000002710000008A" from archive
    2021-03-06 00:36:14.172 MSK [95] LOG:  redo starts at 271/8A000028
    2021-03-06 00:36:   14.325 MSK [95] WARNING:  page 27 of relation base/17946/28008 is uninitialized
    2021-03-06 00:36:14.325 MSK [95] WARNING:  page 15 of relation base/17946/2840 is uninitialized
    2021-03-06 00:36:14.325 MSK [95] PANIC:  WAL contains references to invalid pages
    2021-03-06 00:36:14.681 MSK [92] LOG:  startup process (PID 95) was terminated by signal 6: Aborted
    2021-03-06 00:36:14.681 MSK [92] LOG:  terminating any other active server processes
    2021-03-06 00:36:14.708 MSK [92] LOG:  database system is shut down
    2021-03-06 00:36:16.175 MSK [9] LOG:  database system was interrupted while in recovery at log time 2021-03-01 01:00:11 MSK
    2021-03-06 00:36:16.175 MSK [9] HINT:  If this has occurred more than once some data might be corrupted and you might need to choose an earlier recovery target.
    2021-03-06 00:36:16.263 MSK [9] LOG:  starting point-in-time recovery to 2021-03-01 12:00:00+03
    2021-03-06 00:36:16.306 MSK [9] LOG:  restored log file "0000000B.history" from archive
 
and if i watch in 0000000B000002710000008A. i see:
 
    bash-4.2$  /usr/pgsql-11/bin/pg_waldump /var/lib/pgsql/11/data/pg_wal/0000000B000002710000008A -e 271/8A009E10
 
    rmgr: Standby     len (rec/tot):     50/    50, tx:          0, lsn: 271/8A000028, prev 271/89A60D10, desc: RUNNING_XACTS nextXid 361855004 latestCompletedXid 361855003 oldestRunningXid 361855004
    rmgr: XLOG        len (rec/tot):    106/   106, tx:          0, lsn: 271/8A000060, prev 271/8A000028, desc: CHECKPOINT_ONLINE redo 271/8A000028; tli 11; prev tli 11; fpw true; xid 0:361855004; oid 29809888; multi 240501; offset 500196; oldest xid 162916086 in DB 17950; oldest multi 1 in DB 17948; oldest/newest commit timestamp xid: 0/0; oldest running xid 361855004; online
    rmgr: Heap        len (rec/tot):     65/  2685, tx:  361855004, lsn: 271/8A0000D0, prev 271/8A000060, desc: HOT_UPDATE off 4 xmax 361855004 ; new off 10 xmax 0, blkref #0: rel 1663/17948/24604 blk 4 FPW
    rmgr: Transaction len (rec/tot):     34/    34, tx:  361855004, lsn: 271/8A000B50, prev 271/8A0000D0, desc: COMMIT 2021-03-01 01:00:12.752834 MSK
    rmgr: Heap        len (rec/tot):     65/  1941, tx:  361855005, lsn: 271/8A000B78, prev 271/8A000B50, desc: HOT_UPDATE off 3 xmax 361855005 ; new off 5 xmax 0, blkref #0: rel 1663/17948/24604 blk 21 FPW
    rmgr: Transaction len (rec/tot):     34/    34, tx:  361855005, lsn: 271/8A001310, prev 271/8A000B78, desc: COMMIT 2021-03-01 01:00:13.772687 MSK
    rmgr: Heap        len (rec/tot):     65/  3745, tx:  361855006, lsn: 271/8A001338, prev 271/8A001310, desc: HOT_UPDATE off 2 xmax 361855006 ; new off 9 xmax 0, blkref #0: rel 1663/17948/24604 blk 1 FPW
    rmgr: Transaction len (rec/tot):     34/    34, tx:  361855006, lsn: 271/8A0021F8, prev 271/8A001338, desc: COMMIT 2021-03-01 01:00:14.865075 MSK
    rmgr: Heap        len (rec/tot):     73/    73, tx:  361855007, lsn: 271/8A002220, prev 271/8A0021F8, desc: HOT_UPDATE off 5 xmax 361855007 ; new off 11 xmax 0, blkref #0: rel 1663/17948/24604 blk 4
    rmgr: Transaction len (rec/tot):     34/    34, tx:  361855007, lsn: 271/8A002270, prev 271/8A002220, desc: COMMIT 2021-03-01 01:00:15.887661 MSK
    rmgr: Heap        len (rec/tot):     59/  1855, tx:  361855008, lsn: 271/8A002298, prev 271/8A002270, desc: LOCK off 27: xid 361855008: flags 0 LOCK_ONLY EXCL_LOCK KEYS_UPDATED , blkref #0: rel 1663/17946/27953 blk 0 FPW
    rmgr: Heap        len (rec/tot):     69/    69, tx:  361855008, lsn: 271/8A0029D8, prev 271/8A002298, desc: HOT_UPDATE off 27 xmax 361855008 KEYS_UPDATED ; new off 28 xmax 361855008, blkref #0: rel 1663/17946/27953 blk 0
    rmgr: Heap        len (rec/tot):     54/  8134, tx:  361855008, lsn: 271/8A002A20, prev 271/8A0029D8, desc: INSERT off 115, blkref #0: rel 1663/17946/27924 blk 2602 FPW
    rmgr: Btree       len (rec/tot):     53/  3353, tx:  361855008, lsn: 271/8A004A00, prev 271/8A002A20, desc: INSERT_LEAF off 163, blkref #0: rel 1663/17946/27930 blk 11930 FPW
    rmgr: Heap        len (rec/tot):     54/  5906, tx:  361855008, lsn: 271/8A005720, prev 271/8A004A00, desc: INSERT off 102, blkref #0: rel 1663/17946/28008 blk 8 FPW
    rmgr: Btree       len (rec/tot):     53/  4193, tx:  361855008, lsn: 271/8A006E50, prev 271/8A005720, desc: INSERT_LEAF off 205, blkref #0: rel 1663/17946/28011 blk 335 FPW
    rmgr: Btree       len (rec/tot):     53/  6873, tx:  361855008, lsn: 271/8A007EB8, prev 271/8A006E50, desc: INSERT_LEAF off 4, blkref #0: rel 1663/17946/28013 blk 2 FPW
    rmgr: Heap        len (rec/tot):     59/  1115, tx:  361855008, lsn: 271/8A0099B0, prev 271/8A007EB8, desc: LOCK off 9: xid 361855008: flags 0 LOCK_ONLY EXCL_LOCK KEYS_UPDATED , blkref #0: rel 1663/17946/27484 blk 2 FPW
for a complete picture, I attach information about the cluster
 
    bash-4.2$ /usr/pgsql-11/bin/pg_controldata /var/lib/pgsql/11/data/
 
    pg_control version number:            1100
    Catalog version number:               201809051
    Database system identifier:           6725332219754936319
    Database cluster state:               in archive recovery
    pg_control last modified:             Sat Mar  6 00:36:16 2021
    Latest checkpoint location:           271/8A000060
    Latest checkpoint's REDO location:    271/8A000028
    Latest checkpoint's REDO WAL file:    0000000B000002710000008A
    Latest checkpoint's TimeLineID:       11
    Latest checkpoint's PrevTimeLineID:   11
    Latest checkpoint's full_page_writes: on
    Latest checkpoint's NextXID:          0:361855004
    Latest checkpoint's NextOID:          29809888
    Latest checkpoint's NextMultiXactId:  240501
    Latest checkpoint's NextMultiOffset:  500196
    Latest checkpoint's oldestXID:        162916086
    Latest checkpoint's oldestXID's DB:   17950
    Latest checkpoint's oldestActiveXID:  361855004
    Latest checkpoint's oldestMultiXid:   1
    Latest checkpoint's oldestMulti's DB: 17948
    Latest checkpoint's oldestCommitTsXid:0
    Latest checkpoint's newestCommitTsXid:0
    Time of latest checkpoint:            Mon Mar  1 01:00:11 2021
    Fake LSN counter for unlogged rels:   0/1
    Minimum recovery ending location:     271/8AFB4028
    Min recovery ending loc's timeline:   11
    Backup start location:                0/0
    Backup end location:                  0/0
    End-of-backup record required:        no
    wal_level setting:                    replica
    wal_log_hints setting:                off
    max_connections setting:              1000
    max_worker_processes setting:         10
    max_prepared_xacts setting:           0
    max_locks_per_xact setting:           64
    track_commit_timestamp setting:       off
    Maximum data alignment:               8
    Database block size:                  8192
    Blocks per segment of large relation: 131072
    WAL block size:                       8192
    Bytes per WAL segment:                16777216
    Maximum length of identifiers:        64
    Maximum columns in an index:          32
    Maximum size of a TOAST chunk:        1996
    Size of a large-object chunk:         2048
    Date/time type storage:               64-bit integers
    Float4 argument passing:              by value
    Float8 argument passing:              by value
    Data page checksum version:           0
    Mock authentication nonce:            66bd29a825f4f3e8aea908b50f798ba53215e9f074129724e5a0936a51442e4f
 
--
Vladimir skilyazhnev
 

In response to

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message David Steele 2021-03-06 14:37:30 Re: BUG #16894: PANIC: WAL contains references to invalid pages
Previous Message Alvaro Herrera 2021-03-05 20:17:48 Re: BUG #16912: pg_dump 11 does not respect --quote-all-identifiers within function bodies