From: | Bernd Helmle <bernd(at)oopsware(dot)de> |
---|---|
To: | pgsql-hackers(at)postgresql(dot)org |
Subject: | 9.3.9 and pg_multixact corruption |
Date: | 2015-09-10 21:26:47 |
Message-ID: | 7E3C7F8D210AC9A423E96F3A@eje.local |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
A customer had a severe issue with a PostgreSQL 9.3.9/sparc64/Solaris 11
instance.
The database crashed with the following log messages:
2015-09-08 00:49:16 CEST [2912] PANIC: could not access status of
transaction 1068235595
2015-09-08 00:49:16 CEST [2912] DETAIL: Could not open file
"pg_multixact/members/FFFF5FC4": No such file or directory.
2015-09-08 00:49:16 CEST [2912] STATEMENT: delete from StockTransfer
where oid = $1 and tanum = $2
When they called us later, it turned out that the crash happened during a
base backup, leaving a backup_label behind which prevented the database
coming up again with a invalid checkpoint location. However, removing the
backup_label still didn't let the database through recovery, it failed
again with the former error, this time during recovery:
2015-09-08 11:40:04 CEST [27047] LOG: database system was interrupted
while in recovery at 2015-09-08 11:19:44 CEST
2015-09-08 11:40:04 CEST [27047] HINT: This probably means that some data
is corrupted and you will have to use the last backup for recovery.
2015-09-08 11:40:04 CEST [27047] LOG: database system was not properly
shut down; automatic recovery in progress
2015-09-08 11:40:05 CEST [27047] LOG: redo starts at 1A52/2313FEF8
2015-09-08 11:40:47 CEST [27082] FATAL: the database system is starting up
2015-09-08 11:40:59 CEST [27047] FATAL: could not access status of
transaction 1068235595
2015-09-08 11:40:59 CEST [27047] DETAIL: Could not seek in file
"pg_multixact/members/FFFF5FC4" to offset 4294950912: Invalid argument.
2015-09-08 11:40:59 CEST [27047] CONTEXT: xlog redo create mxid 1068235595
offset 2147483648 nmembers 2: 2896635220 (upd) 2896635510 (keysh)
2015-09-08 11:40:59 CEST [27045] LOG: startup process (PID 27047) exited
with exit code 1
2015-09-08 11:40:59 CEST [27045] LOG: aborting startup due to startup
process failure
Some side notes:
An additional recovery from a base backup and archive recovery yield to the
same error, as soon as the affected tuple was touched with a DELETE. The
affected table was fully dumpable via pg_dump, though.
We also have a core dump, but no direct access to the machine. If there's
more information required (and i believe it is), let me know where to dig
deeper. I also would like to request a backtrace from the existing core
dump, but in the absence of a sparc64 machine here we need to ask the
customer to get one.
--
Thanks
Bernd
From | Date | Subject | |
---|---|---|---|
Next Message | Alvaro Herrera | 2015-09-10 21:28:32 | Re: statistics for array types |
Previous Message | Robert Haas | 2015-09-10 21:24:00 | Re: Foreign join pushdown vs EvalPlanQual |