Re: BUG #15567: Wal receiver process restart failed when a damaged wal record arrived at standby.

From: Maxim Boguk <maxim(dot)boguk(at)gmail(dot)com>
To: lichuancheng(at)highgo(dot)com, pgsql-bugs(at)lists(dot)postgresql(dot)org
Subject: Re: BUG #15567: Wal receiver process restart failed when a damaged wal record arrived at standby.
Date: 2018-12-29 03:59:34
Message-ID: CAK-MWwTYxLo9M5tzGToZeNj6YLKCcBhNETnGTbLexfj53RsW-w@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

On Fri, Dec 28, 2018 at 8:10 PM PG Bug reporting form <
noreply(at)postgresql(dot)org> wrote:

> The following bug has been logged on the website:
>
> Bug reference: 15567
> Logged by: movead
> Email address: lichuancheng(at)highgo(dot)com
> PostgreSQL version: 10.5
> Operating system: centos7
> Description:
>
> [Problem description]
> We use pg9.5.7 with replica and i find that if some damaged wal record
> arrived at standby then the startup process on standby try to restart
> walreceiver process but never sucess and log below Infinite cycle on
> standby.If i try to restart database now,it sucess.
> (I try it on pg10.5,the same result.)
>
> -----------------------------------------------------------------------------------------------------------------
> 2018-12-15 10:38:00.302 CST,,,30928,,5c1422d5.78d0,2,,2018-12-15 05:38:29
> CST,,0,FATAL,57P01,"terminating walreceiver process due to administrator
> command",,,,,,,,,""
> 2018-12-15 10:38:00.302 CST,,,5546,,5c1276c1.15aa,8,,2018-12-13 23:12:01
> CST,1/0,0,LOG,00000,"incorrect resource manager data checksum in record at
> 5A9/11E0A90",,,,,,,,,""
> 2018-12-15 10:38:00.302 CST,,,5546,,5c1276c1.15aa,9,,2018-12-13 23:12:01
> CST,1/0,0,LOG,00000,"incorrect resource manager data checksum in record at
> 5A9/11E0A90",,,,,,,,,""
> 2018-12-15 10:38:00.302 CST,,,5546,,5c1276c1.15aa,10,,2018-12-13 23:12:01
> CST,1/0,0,LOG,00000,"incorrect resource manager data checksum in record at
> 5A9/11E0A90",,,,,,,,,""
> 2018-12-15 10:38:05.308 CST,,,5546,,5c1276c1.15aa,11,,2018-12-13 23:12:01
> CST,1/0,0,LOG,00000,"incorrect resource manager data checksum in record at
> 5A9/11E0A90",,,,,,,,,""
> 2018-12-15 10:38:10.314 CST,,,5546,,5c1276c1.15aa,12,,2018-12-13 23:12:01
> CST,1/0,0,LOG,00000,"incorrect resource manager data checksum in record at
> 5A9/11E0A90",,,,,,,,,""
> 2018-12-15 10:38:15.319 CST,,,5546,,5c1276c1.15aa,13,,2018-12-13 23:12:01
> CST,1/0,0,LOG,00000,"incorrect resource manager data checksum in record at
> 5A9/11E0A90",,,,,,,,,""
> 2018-12-15 10:38:20.325 CST,,,5546,,5c1276c1.15aa,14,,2018-12-13 23:12:01
> CST,1/0,0,LOG,00000,"incorrect resource manager data checksum in record at
> 5A9/11E0A90",,,,,,,,,""
> 2018-12-15 10:38:25.331 CST,,,5546,,5c1276c1.15aa,15,,2018-12-13 23:12:01
> CST,1/0,0,LOG,00000,"incorrect resource manager data checksum in record at
> 5A9/11E0A90",,,,,,,,,""
> 2018-12-15 10:38:30.336 CST,,,5546,,5c1276c1.15aa,16,,2018-12-13 23:12:01
> CST,1/0,0,LOG,00000,"incorrect resource manager data checksum in record at
> 5A9/11E0A90",,,,,,,,,""
> 2018-12-15 10:38:35.342 CST,,,5546,,5c1276c1.15aa,17,,2018-12-13 23:12:01
> CST,1/0,0,LOG,00000,"incorrect resource manager data checksum in record at
> 5A9/11E0A90",,,,,,,,,""
> 2018-12-15 10:38:40.348 CST,,,5546,,5c1276c1.15aa,18,,2018-12-13 23:12:01
> CST,1/0,0,LOG,00000,"incorrect resource manager data checksum in record at
> 5A9/11E0A90",,,,,,,,,""
> 2018-12-15 10:38:45.353 CST,,,5546,,5c1276c1.15aa,19,,2018-12-13 23:12:01
> CST,1/0,0,LOG,00000,"incorrect resource manager data checksum in record at
> 5A9/11E0A90",,,,,,,,,""
> 2018-12-15 10:38:50.359 CST,,,5546,,5c1276c1.15aa,20,,2018-12-13 23:12:01
> CST,1/0,0,LOG,00000,"incorrect resource manager data checksum in record at
> 5A9/11E0A90",,,,,,,,,""
> 2018-12-15 10:38:55.364 CST,,,5546,,5c1276c1.15aa,21,,2018-12-13 23:12:01
> CST,1/0,0,LOG,00000,"incorrect resource manager data checksum in record at
> 5A9/11E0A90",,,,,,,,,""
> 2018-12-15 10:39:00.370 CST,,,5546,,5c1276c1.15aa,22,,2018-12-13 23:12:01
> CST,1/0,0,LOG,00000,"incorrect resource manager data checksum in record at
> 5A9/11E0A90",,,,,,,,,""
>
> -----------------------------------------------------------------------------------------------------------------
>
>
> [Code review]
> I read the code about startup process and walreciver process.
>
> When a damaged wal record arrived:
> 1.The startup process shutdown the walreciver process use function
> ShutdownWalRcv().
> 2.The startup process start the walreciver process use funtion
> RequestXLogStreaming() by signal PMSIGNAL_START_WALRECEIVER.
> 3.The startup process try to read wal record.
> And now reciver process does not startup completely,so the startup
> process
> read the damaged wal record another times.
> And startup process set walrcv->walRcvState = WALRCV_STOPPED use
> function
> ShutdownWalRcv().
> 4.go to 2.
> That's the infinite cycle.
>
> [Question]
> I wonder that if it is a bug or you just don't want restart walreciver
> process.
>
>
It look like the same issue as I reported in BUG #15151.
https://www.postgresql.org/message-id/flat/152353560977.31233.11406243320876395727%40wrigleys.postgresql.org
(unfortunately I didn't have time to explore it further then).

You have good description of what's going on under hood.

--
Maxim Boguk
Senior Postgresql DBA
https://dataegret.com/

In response to

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message lichuancheng@highgo.com 2018-12-29 05:08:18 Re: Re: BUG #15567: Wal receiver process restart failed when a damaged wal record arrived at standby.
Previous Message PG Bug reporting form 2018-12-28 10:26:04 BUG #15568: pg_dump error