Physical slot restart_lsn advances incorrectly requiring restore from archive

From: "Hsu, John" <hsuchen(at)amazon(dot)com>
To: "pgsql-hackers(at)lists(dot)postgresql(dot)org" <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Physical slot restart_lsn advances incorrectly requiring restore from archive
Date: 2020-07-10 20:44:30
Message-ID: 77734732-44A4-4209-8C2F-3AF36C9D4D18@amazon.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi hackers,

We believe we’re seeing a problem with how physical slot’s restart_lsn is advanced leading to the replicas needing to restore from archive in order for replication to resume.
The logs below are from reproductions against 10.13. I’m still working on reproducing it for 12.3.

WAL write spans two WAL segments .
Write to first WAL segment is complete but not to the second segment.
Write to first WAL segment is acknowledged as flushed from the Postgres replica.
Primary restarts before the write to second segment completes. It also means the complete WAL was never written.
Crash recovery finishes at a record before the incomplete WAL write.
Though now replica start the slot at the next WAL segment, since the previous WAL was already acknowledged as flushed.

Primary crashes because it ran out of space:
2020-07-10 01:23:31.399 UTC:10.15.4.83(56430):replication_user(at)[unknown]:[4554]:DEBUG: write 0/2C000000 flush 0/2BFD2000 apply 0/216ACCD0
2020-07-10 01:23:31.401 UTC:10.15.4.83(56430):replication_user(at)[unknown]:[4554]:DEBUG: write 0/2C000000 flush 0/2C000000 apply 0/216AE728
2020-07-10 01:23:31.504 UTC::@:[4548]:DEBUG: creating and filling new WAL file
2020-07-10 01:23:31.511 UTC::@:[4548]:PANIC: could not write to file "pg_wal/xlogtemp.4548": No space left on device
2020-07-10 01:23:31.518 UTC::@:[4543]:DEBUG: reaping dead processes

Crash recovery beings:
2020-07-10 01:23:36.074 UTC::@:[8677]:DEBUG: checkpoint record is at 0/2B51B030
2020-07-10 01:23:36.074 UTC::@:[8677]:DEBUG: redo record is at 0/2A65AE08; shutdown FALSE
..
2020-07-10 01:23:36.076 UTC::@:[8677]:DEBUG: starting up replication slots
2020-07-10 01:23:36.076 UTC::@:[8677]:DEBUG: restoring replication slot from "pg_replslot/physical_slot_1/state"
2020-07-10 01:23:36.078 UTC::@:[8677]:LOG: restart_lsn for cp slot physical_slot_1: 0/2BF12000 (extra debug logs I added)
2020-07-10 01:23:36.081 UTC::@:[8677]:LOG: redo starts at 0/2A65AE08
...
2020-07-10 01:23:36.325 UTC::@:[8677]:LOG: redo done at 0/2BFFFFB0
...
2020-07-10 01:23:36.330 UTC::@:[8677]:LOG: checkpoint starting: end-of-recovery immediate
2020-07-10 01:23:36.332 UTC::@:[8677]:DEBUG: performing replication slot checkpoint
...
2020-07-10 01:23:36.380 UTC::@:[8677]:DEBUG: checkpoint sync: number=13 file=base/13934/2662 time=0.001 msec
2020-07-10 01:23:36.380 UTC::@:[8677]:DEBUG: checkpoint sync: number=14 file=base/13934/2663 time=0.001 msec
2020-07-10 01:23:36.380 UTC::@:[8677]:DEBUG: checkpoint sync: number=15 file=base/13934/24586 time=0.001 msec
2020-07-10 01:23:36.385 UTC::@:[8677]:LOG: could not signal for checkpoint: checkpointer is not running
2020-07-10 01:23:36.385 UTC::@:[8677]:DEBUG: creating and filling new WAL file
2020-07-10 01:23:36.397 UTC::@:[8677]:PANIC: could not write to file "pg_wal/xlogtemp.8677": No space left on device

Primary runs out of space during crash recovery. Space is freed up afterwards and crash recovery beings again.

2020-07-10 01:32:45.804 UTC::@:[16329]:DEBUG: checkpoint record is at 0/2B51B030
2020-07-10 01:32:45.805 UTC::@:[16329]:DEBUG: redo record is at 0/2A65AE08; shutdown FALSE
...
2020-07-10 01:32:45.805 UTC::@:[16329]:DEBUG: starting up replication slots
2020-07-10 01:32:45.805 UTC::@:[16329]:DEBUG: restoring replication slot from "pg_replslot/physical_slot_1/state"
2020-07-10 01:32:45.806 UTC::@:[16329]:LOG: restart_lsn for cp slot physical_slot_1: 0/2BF12000
...
2020-07-10 01:32:45.809 UTC::@:[16329]:LOG: redo starts at 0/2A65AE08
2020-07-10 01:32:46.043 UTC::@:[16329]:DEBUG: could not open file "pg_wal/00000001000000000000002C": No such file or directory
2020-07-10 01:32:46.043 UTC::@:[16329]:LOG: redo done at 0/2BFFFFB0

Redo finishes at 0/2BFFFFB0 even though the flush we received from the replica is already at 0/2C000000.

This is problematic because the replica reconnects to the slot telling it to start past the new redo point.

2020-07-10 01:32:50.641 UTC:10.15.4.83(56698):replication_user(at)[unknown]:[16572]:DEBUG: received replication command: START_REPLICATION SLOT "physical_slot_1" 0/2C000000 TIMELINE 1
2020-07-10 01:32:50.641 UTC:10.15.4.83(56698):replication_user(at)[unknown]:[16572]:DEBUG: "walreceiver" has now caught up with upstream server
2020-07-10 01:32:50.774 UTC:10.15.4.83(56698):replication_user(at)[unknown]:[16572]:DEBUG: write 0/2C000B80 flush 0/2BFFFFF0 apply 0/2BFFFFF0
2020-07-10 01:32:50.775 UTC:10.15.4.83(56698):replication_user(at)[unknown]:[16572]:DEBUG: write 0/2C000B80 flush 0/2C000B80 apply 0/2BFFFFF0

This leads to a mismatch between at the end of 0/2B and what was streamed to the replica.

Replica logs:
2020-07-10 01:32:50.671 UTC::@:[24899]:LOG: started streaming WAL from primary at 0/2C000000 on timeline 1
...
2020-07-10 01:39:32.251 UTC::@:[11703]:DEBUG: could not restore file "00000001000000000000002C" from archive: child process exited with exit code 1
2020-07-10 01:39:32.251 UTC::@:[11703]:DEBUG: invalid contrecord length 90 at 0/2BFFFFF0
2020-07-10 01:39:32.251 UTC::@:[11703]:DEBUG: switched WAL source from archive to stream after failure

Now the physical slot has advanced past the 0/2B which is what the replica actually needs.

postgres=> select * from pg_replication_slots;
slot_name | plugin | slot_type | datoid | database | temporary | active | active_pid | xmin | catalog_xmin | restart_lsn | confirmed_flush_lsn
---------------------------------------------+--------+-----------+--------+----------+-----------+--------+------------+------+--------------+-------------+---------------------
physical_slot_1 | | physical | | | f | f | | | | 0/2C000B80 |
(1 row)


0/2B can now rotate out on the primary and requires restoring from archive in order for replication to resume.

The attached patch (against 10) attempts to address this by keeping track of the first flushLsn in the current segNo, and wait until we receive one after that before updating. This prevents the WAL from rotating out of the primary and a reboot from the replica will fix it instead of needing to restore from archive.

With the patch:

Primary goes into crash recovery and we avoid updating the restart_lsn of the slot:

2020-07-10 18:50:12.686 UTC::@:[6160]:LOG: redo starts at 0/2D417108
2020-07-10 18:50:12.965 UTC::@:[6160]:DEBUG: could not open file "pg_wal/00000001000000000000002F": No such file or directory
2020-07-10 18:50:12.965 UTC::@:[6160]:LOG: redo done at 0/2EFFFF90
...
2020-07-10 18:59:32.987 UTC:10.15.0.240(9056):replication_user(at)[unknown]:[19623]:DEBUG: received replication command: START_REPLICATION SLOT "physical_slot_2" 0/2F000000 TIMELINE 1
2020-07-10 18:59:33.937 UTC:10.15.0.240(9056):replication_user(at)[unknown]:[19623]:DEBUG: write 0/2F020000 flush 0/2EFFFFD0 apply 0/2EFFFFD0
2020-07-10 18:59:33.937 UTC:10.15.0.240(9056):replication_user(at)[unknown]:[19623]:LOG: lsn is not in restartSegNo, update to match
2020-07-10 18:59:33.938 UTC:10.15.0.240(9056):replication_user(at)[unknown]:[19623]:DEBUG: write 0/2F020000 flush 0/2F020000 apply 0/2EFFFFD0
2020-07-10 18:59:33.938 UTC:10.15.0.240(9056):replication_user(at)[unknown]:[19623]:LOG: lsn is not in restartSegNo, update to match

Replica logs:
2020-07-10 18:59:54.040 UTC::@:[12873]:DEBUG: could not restore file "00000001000000000000002F" from archive: child process exited with exit code 1
2020-07-10 18:59:54.040 UTC::@:[12873]:DEBUG: invalid contrecord length 58 at 0/2EFFFFD0


Since the flushLSN hasn't advanced past the first one in the restartSegNo, it doesn't get updated in future checkpoints.

postgres=> select * from pg_replication_slots;
slot_name | plugin | slot_type | datoid | database | temporary | active | active_pid | xmin | catalog_xmin | restart_lsn | confirmed_flush_lsn
---------------------------------------------+--------+-----------+--------+----------+-----------+--------+------------+------+--------------+-------------+---------------------
physical_slot_2 | | physical | | | f | f | | | | 0/2DFF8000 |
(1 row)


Rebooting the replica allows replication to resume from the slot and restart_lsn advances normally.

Thanks,

John H

Attachment Content-Type Size
0001-delay-slot-restart-lsn.patch application/octet-stream 1.8 KB

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2020-07-10 20:54:40 Re: "tuple concurrently updated" in pg_restore --jobs
Previous Message Alvaro Herrera 2020-07-10 20:37:13 Re: Wait profiling