Re: BUG #15567: Wal receiver process restart failed when a damaged wal record arrived at standby.

From: "lichuancheng(at)highgo(dot)com" <lichuancheng(at)highgo(dot)com>
To: Michael Paquier <michael(at)paquier(dot)xyz>
Cc: "pgsql-bugs(at)lists(dot)postgresql(dot)org" <pgsql-bugs(at)lists(dot)postgresql(dot)org>
Subject: Re: BUG #15567: Wal receiver process restart failed when a damaged wal record arrived at standby.
Date: 2019-01-09 08:28:51
Message-ID: 2019010916285112831844@highgo.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs


>So the take of the patch is to make sure that the WAL receiver has
>been correctly started if the startup process requested one before
>trying to replay more WAL records. Now it seems to me that you do not
>close all the gaps here. The WAL receiver could finish starting, but
>still not have the time to stream up to the point where the startup
>process could consume things safely. The patch makes somewhat sure
>that the gap is closed by applying a timeout of 1s, still in
>environments with more latency it would make the failure still
>happen. Wouldn't it be better instead to make sure that the startup
>process does not try to use once again the same record where the
>failure happened before the WAL receiver confirms that it has flushed
>something newer than the LSN position which was known as bad?

>One problem may be that upon restart we reset receivedUpto, and that
>visibly we may want to be more careful with its handling to avoid
>successive failures so as the startup process does not bump
>repeatitively into the same bogus record. Looking at the code, the
>computation looks quite safe though (I need to double-check to be 100%
>sure), so isn't the origin of the problem just that the WAL segment
>from the primary is corrupted, and that what gets streamed is broken
>at its origin point, preventing the standby to move on? You may want
>to double-check that as well..

I known what you describe.
Nomally,when the reciver process restart,the code reset redo LSN to the start of the
wal segment. But now the question is that the reciver process can not restart
sucessful,so correct wal record nerver arrived. so the startup process is in a
infinite cycle to start and end the reciver process.
So the way to solve the problem is give some time to reciver process to
restart. As long as it restart sucessful,it will apply wal record at the start of the
wal segment from master and the correct wal record will arrived. so i do not
think what you say 'still in environments with more latency it would make the
failure still happen' is right.So i think the patch can solve the problem after
modify as side note blow.

I have make sure that the the broken wal record is broked during stream,
it's well in master and other slave.


>As a side note, the patch is incorrect for a couple of things though:
>1) Formatting of comments and indentation.
>2) Please be careful of whitespaces.
>3) Usually no more than 72-80 characters per line.
>4) Why a wait phase of 1s?
I think 1s is enought for reciver process to restart. Maybe it need a discuss.

>5) And much worse, the WAL receiver status cannot be looked at without
>taking its mutex lock first, so this makes the logic look at something
>which may be concurrently updated. Oops.
I am sorry.

>6) Still the patch is incorrect per the reasons above ;)
I am new here, thanks for your explanation in side note.

-----
movead

In response to

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message Andrew Gierth 2019-01-09 09:26:47 Re: Unwanted effect of search_path on default-value entries in case of serial columns
Previous Message Holger Jakobs 2019-01-09 08:19:59 Unwanted effect of search_path on default-value entries in case of serial columns