Re: Bug on update timing of walrcv->flushedUpto variable

From: Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>
To: mengjuan(dot)cmj(at)alibaba-inc(dot)com
Cc: bossartn(at)amazon(dot)com, x4mmm(at)yandex-team(dot)ru, hlinnaka(at)iki(dot)fi, pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject: Re: Bug on update timing of walrcv->flushedUpto variable
Date: 2021-03-29 01:54:41
Message-ID: 20210329.105441.1978082841561262877.horikyota.ntt@gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi.

(Added Nathan, Andrey and Heikki in Cc:)

At Fri, 26 Mar 2021 23:44:21 +0800, "蔡梦娟(玊于)" <mengjuan(dot)cmj(at)alibaba-inc(dot)com> wrote in
> Hi, all
>
> Recently, I found a bug on update timing of walrcv->flushedUpto variable, consider the following scenario, there is one Primary node, one Standby node which streaming from Primary:
> There are a large number of SQL running in the Primary, and the length of the xlog record generated by these SQL maybe greater than the left space of current page so that it needs to be written cross pages. As shown below, the length of the last_xlog of wal_1 is greater than the left space of last_page, so it has to be written in wal_2. If Primary crashed after flused the last_page of wal_1 to disk, the remian content of last_xlog hasn't been flushed in time, then the last_xlog in wal_1 will be incomplete. And Standby also received the wal_1 by wal-streaming in this case.

It seems like the same with the issue discussed in [1].

There are two symptom of the issue, one is that archive ends with a
segment that ends with a immature WAL record, which causes
inconsistency between archive and pg_wal directory. Another is , as
you saw, walreceiver receives an immature record at the end of a
segment, which prevents recovery from proceeding.

In the thread, trying to solve that by preventing such an immature
records at a segment boundary from being archived and inhibiting being
sent to standby.

> [日志1.png]

It doesn't seem attached..

> The confusing point is: why only updates the walrcv->flushedUpto at the first startup of walreceiver on a specific timeline, not each time when request xlog streaming? In above case, it is also reasonable to update walrcv->flushedUpto to wal_1 when Standby re-receive wal_1. So I changed to update the walrcv->flushedUpto each time when request xlog streaming, which is the patch I want to share with you, based on postgresql-13.2, what do you think of this change?
>
> By the way, I also want to know why call pgstat_reset_all function during recovery process?

We shouldn't rewind flushedUpto to backward. The variable notifies how
far recovery (or startup process) can read WAL content safely. Once
startup process reads the beginning of a record, XLogReadRecord tries
to continue fetching *only the rest* of the record, which is
inconsistent from the first part in this scenario. So at least only
this fix doesn't work fine. And we also need to fix the archive
inconsistency, maybe as a part of a fix for this issue.

We are trying to fix this by refraining from archiving (or streaming)
until a record crossing a segment boundary is completely flushed.

regards.

[1] https://www.postgresql.org/message-id/CBDDFA01-6E40-46BB-9F98-9340F4379505%40amazon.com

--
Kyotaro Horiguchi
NTT Open Source Software Center

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Michael Paquier 2021-03-29 01:57:00 Re: Allow matching whole DN from a client certificate
Previous Message Fujii Masao 2021-03-29 01:53:14 Re: TRUNCATE on foreign table