9.3.9 and pg_multixact corruption

From: Bernd Helmle <bernd(at)oopsware(dot)de>
To: pgsql-hackers(at)postgresql(dot)org
Subject: 9.3.9 and pg_multixact corruption
Date: 2015-09-10 21:26:47
Message-ID: 7E3C7F8D210AC9A423E96F3A@eje.local
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

A customer had a severe issue with a PostgreSQL 9.3.9/sparc64/Solaris 11
instance.

The database crashed with the following log messages:

2015-09-08 00:49:16 CEST [2912] PANIC: could not access status of
transaction 1068235595
2015-09-08 00:49:16 CEST [2912] DETAIL: Could not open file
"pg_multixact/members/FFFF5FC4": No such file or directory.
2015-09-08 00:49:16 CEST [2912] STATEMENT: delete from StockTransfer
where oid = $1 and tanum = $2

When they called us later, it turned out that the crash happened during a
base backup, leaving a backup_label behind which prevented the database
coming up again with a invalid checkpoint location. However, removing the
backup_label still didn't let the database through recovery, it failed
again with the former error, this time during recovery:

2015-09-08 11:40:04 CEST [27047] LOG: database system was interrupted
while in recovery at 2015-09-08 11:19:44 CEST
2015-09-08 11:40:04 CEST [27047] HINT: This probably means that some data
is corrupted and you will have to use the last backup for recovery.
2015-09-08 11:40:04 CEST [27047] LOG: database system was not properly
shut down; automatic recovery in progress
2015-09-08 11:40:05 CEST [27047] LOG: redo starts at 1A52/2313FEF8
2015-09-08 11:40:47 CEST [27082] FATAL: the database system is starting up
2015-09-08 11:40:59 CEST [27047] FATAL: could not access status of
transaction 1068235595
2015-09-08 11:40:59 CEST [27047] DETAIL: Could not seek in file
"pg_multixact/members/FFFF5FC4" to offset 4294950912: Invalid argument.
2015-09-08 11:40:59 CEST [27047] CONTEXT: xlog redo create mxid 1068235595
offset 2147483648 nmembers 2: 2896635220 (upd) 2896635510 (keysh)
2015-09-08 11:40:59 CEST [27045] LOG: startup process (PID 27047) exited
with exit code 1
2015-09-08 11:40:59 CEST [27045] LOG: aborting startup due to startup
process failure

Some side notes:

An additional recovery from a base backup and archive recovery yield to the
same error, as soon as the affected tuple was touched with a DELETE. The
affected table was fully dumpable via pg_dump, though.

We also have a core dump, but no direct access to the machine. If there's
more information required (and i believe it is), let me know where to dig
deeper. I also would like to request a backtrace from the existing core
dump, but in the absence of a sparc64 machine here we need to ask the
customer to get one.

--
Thanks

Bernd

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Alvaro Herrera 2015-09-10 21:28:32 Re: statistics for array types
Previous Message Robert Haas 2015-09-10 21:24:00 Re: Foreign join pushdown vs EvalPlanQual