Bug on update timing of walrcv->flushedUpto variable

From: 蔡梦娟(玊于) <mengjuan(dot)cmj(at)alibaba-inc(dot)com>
To: "pgsql-hackers" <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Bug on update timing of walrcv->flushedUpto variable
Date: 2021-03-26 15:44:21
Message-ID: 3f9c466d-d143-472c-a961-66406172af96.mengjuan.cmj@alibaba-inc.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi, all

Recently, I found a bug on update timing of walrcv->flushedUpto variable, consider the following scenario, there is one Primary node, one Standby node which streaming from Primary:
There are a large number of SQL running in the Primary, and the length of the xlog record generated by these SQL maybe greater than the left space of current page so that it needs to be written cross pages. As shown below, the length of the last_xlog of wal_1 is greater than the left space of last_page, so it has to be written in wal_2. If Primary crashed after flused the last_page of wal_1 to disk, the remian content of last_xlog hasn't been flushed in time, then the last_xlog in wal_1 will be incomplete. And Standby also received the wal_1 by wal-streaming in this case.
[日志1.png]

Primary restarts after crash, during the crash recovery, Primary will find that the last_xlog of wal_1 is invalid, and it will cover the space of last_xlog by inserting new xlog record. However, Standby won't do this, and there will be xlog inconsistency between Primary and standby at this time.

When Standby restarts and replays the last_xlog, it will first get the content of XLogRecord structure (the header of last_xlog is completed flushed), and find that it has to reassemble the last_xlog, the next page of last_xlog is within wal_2, which not exists in pg_wal of Standby. So it request xlog streaming from Primary to get the wal_2, and update the walrcv->flushedUpto when it has received new xlog and flushed them to disk, now the value of walrcv->flushedUpto is some LSN within wal_2.

Standby get wal_2 from Primary, but the content of the first page of wal_2 is not the remain content of last_xlog, which has already been covered by new xlog in Primary. Standby checked and found that the record is invalid, it will read the last_xlog again, and call the WaitForWALToBecomeAvailable function, in this function it will shutdown the wal-streaming and read the record from pg_wal.

Again, the record read from pg_wal is also invalid, so Standby will request wal-streaming again, and it is worth noting that the value of walrcv->flushedUpto has already been set to wal_2 before, which is greater than the LSN Standby needs, so the variable havedata in WaitForWALToBecomeAvailable is always true, and Standby considers that it received the xlog, it will read the content from wal_2.

Next is the endless loop: Standby found the xlog is invalid -> read the last_xlog again -> shutdown wal-streaming and read xlog from pg_wal -> found the xlog is invalid -> request wal-streaming, expect to get the correct xlog, but it will return from WaitForWALToBecomeAvailable immediately because the walrcv->flushedUpto is always greater than the LSN it needs ->read and found the xlog is invalid -> read the last_xlog again ->......

In this case, Standby will never get the correct xlog record until it restarts

The confusing point is: why only updates the walrcv->flushedUpto at the first startup of walreceiver on a specific timeline, not each time when request xlog streaming? In above case, it is also reasonable to update walrcv->flushedUpto to wal_1 when Standby re-receive wal_1. So I changed to update the walrcv->flushedUpto each time when request xlog streaming, which is the patch I want to share with you, based on postgresql-13.2, what do you think of this change?

By the way, I also want to know why call pgstat_reset_all function during recovery process?

Thanks & Best Regard

Attachment Content-Type Size
0001-Update-walrcv-flushedUpto-each-time-when-request-xlog-streaming.patch application/octet-stream 1.9 KB

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Andrey Borodin 2021-03-26 15:52:22 Re: MultiXact\SLRU buffers configuration
Previous Message Amit Langote 2021-03-26 15:21:44 Re: making update/delete of inheritance trees scale better