From: | Chris Travers <chris(dot)travers(at)gmail(dot)com> |
---|---|
To: | PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Funny WAL corruption issue |
Date: | 2017-08-10 12:09:53 |
Message-ID: | CAKt_ZfvqM8BmxnW6xV0RHDghYaspm0Lv=GOvN6t4jRdvgDEVrw@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
Hi;
I ran into a funny situation today regarding PostgreSQL replication and wal
corruption and wanted to go over what I think happened and what I wonder
about as a possible solution.
Basic information is custom-build PostgreSQL 9.6.3 on Gentoo, on a ~5TB
database with variable load. Master database has two slaves and generates
10-20MB of WAL traffic a second. The data_checksum option is off.
The problem occurred when I attempted to restart the service on the slave
using pg_ctl (I believe the service had been started with sys V init
scripts). On trying to restart, it gave me a nice "Invalid memory
allocation request" error and promptly stopped.
The main logs showed a lot of messages like before the restart:
2017-08-02 11:47:33 UTC LOG: PID 19033 in cancel request did not match any
process
2017-08-02 11:47:33 UTC LOG: PID 19032 in cancel request did not match any
process
2017-08-02 11:47:33 UTC LOG: PID 19024 in cancel request did not match any
process
2017-08-02 11:47:33 UTC LOG: PID 19034 in cancel request did not match any
process
On restart, the following was logged to stderr:
LOG: entering standby mode
LOG: redo starts at 1E39C/8B77B458
LOG: consistent recovery state reached at 1E39C/E1117FF8
FATAL: invalid memory alloc request size 3456458752
LOG: startup process (PID 18167) exited with exit code 1
LOG: terminating any other active server processes
LOG: database system is shut down
After some troubleshooting I found that the wal segment had become corrupt,
I copied the correct one from the master and everything came up to present.
So It seems like somewhere something crashed big time on the back-end and
when we tried to restart, the wal ended in an invalid way.
I am wondering what can be done to prevent these sorts of things from
happening in the future if, for example, a replica dies in the middle of a
wal fsync.
--
Best Wishes,
Chris Travers
Efficito: Hosted Accounting and ERP. Robust and Flexible. No vendor
lock-in.
http://www.efficito.com/learn_more
From | Date | Subject | |
---|---|---|---|
Next Message | Ashutosh Bapat | 2017-08-10 12:14:57 | Re: Partition-wise join for join between (declaratively) partitioned tables |
Previous Message | Robert Haas | 2017-08-10 12:00:44 | Re: Server crash (FailedAssertion) due to catcache refcount mis-handling |