From: | Chris Travers <chris(dot)travers(at)gmail(dot)com> |
---|---|
To: | Vladimir Borodin <root(at)simply(dot)name> |
Cc: | PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: Funny WAL corruption issue |
Date: | 2017-08-10 12:28:59 |
Message-ID: | CAKt_ZfvvsufN-Hv86bTCEd=t1dKLUcFDhA3oBCKTGyLW1eUhRA@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
Sorry, meant to reply all.
On Thu, Aug 10, 2017 at 2:19 PM, Vladimir Borodin <root(at)simply(dot)name> wrote:
> Hi, Chris.
>
> 10 авг. 2017 г., в 15:09, Chris Travers <chris(dot)travers(at)gmail(dot)com>
> написал(а):
>
> Hi;
>
> I ran into a funny situation today regarding PostgreSQL replication and
> wal corruption and wanted to go over what I think happened and what I
> wonder about as a possible solution.
>
> Basic information is custom-build PostgreSQL 9.6.3 on Gentoo, on a ~5TB
> database with variable load. Master database has two slaves and generates
> 10-20MB of WAL traffic a second. The data_checksum option is off.
>
>
> The problem occurred when I attempted to restart the service on the slave
> using pg_ctl (I believe the service had been started with sys V init
> scripts). On trying to restart, it gave me a nice "Invalid memory
> allocation request" error and promptly stopped.
>
> The main logs showed a lot of messages like before the restart:
> 2017-08-02 11:47:33 UTC LOG: PID 19033 in cancel request did not match
> any process
> 2017-08-02 11:47:33 UTC LOG: PID 19032 in cancel request did not match
> any process
> 2017-08-02 11:47:33 UTC LOG: PID 19024 in cancel request did not match
> any process
> 2017-08-02 11:47:33 UTC LOG: PID 19034 in cancel request did not match
> any process
>
> On restart, the following was logged to stderr:
> LOG: entering standby mode
> LOG: redo starts at 1E39C/8B77B458
> LOG: consistent recovery state reached at 1E39C/E1117FF8
> FATAL: invalid memory alloc request size 3456458752
> LOG: startup process (PID 18167) exited with exit code 1
> LOG: terminating any other active server processes
>
> LOG: database system is shut down
>
> After some troubleshooting I found that the wal segment had become
> corrupt, I copied the correct one from the master and everything came up to
> present.
>
> So It seems like somewhere something crashed big time on the back-end and
> when we tried to restart, the wal ended in an invalid way.
>
>
> We have reported the same thing [1] nearly a year ago. Could you please
> check with pg_xlogdump that both WALs (normal from master and corrupted)
> are exactly the same until some certain LSN?
>
> [1] https://www.postgresql.org/message-id/20160614103415.
> 5796.6885%40wrigleys.postgresql.org
>
Yes. Exactly the same output until a certain point where pg_xlogdump dies
with an error. That is the same LSN where PostgreSQL died with an error on
restart.
>
>
> I am wondering what can be done to prevent these sorts of things from
> happening in the future if, for example, a replica dies in the middle of a
> wal fsync.
> --
> Best Wishes,
> Chris Travers
>
> Efficito: Hosted Accounting and ERP. Robust and Flexible. No vendor
> lock-in.
> http://www.efficito.com/learn_more
>
>
>
> --
> May the force be with you…
> https://simply.name
>
>
--
Best Wishes,
Chris Travers
Efficito: Hosted Accounting and ERP. Robust and Flexible. No vendor
lock-in.
http://www.efficito.com/learn_more
From | Date | Subject | |
---|---|---|---|
Next Message | Ashutosh Sharma | 2017-08-10 12:30:00 | Re: pl/perl extension fails on Windows |
Previous Message | Etsuro Fujita | 2017-08-10 12:25:21 | Comment typo in plannodes.h |