Quick Links

9.3.9 and pg_multixact corruption

From:	Bernd Helmle <bernd(at)oopsware(dot)de>
To:	pgsql-hackers(at)postgresql(dot)org
Subject:	9.3.9 and pg_multixact corruption
Date:	2015-09-10 21:26:47
Message-ID:	7E3C7F8D210AC9A423E96F3A@eje.local
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

A customer had a severe issue with a PostgreSQL 9.3.9/sparc64/Solaris 11
instance.

The database crashed with the following log messages:

2015-09-08 00:49:16 CEST [2912] PANIC: could not access status of
transaction 1068235595
2015-09-08 00:49:16 CEST [2912] DETAIL: Could not open file
"pg_multixact/members/FFFF5FC4": No such file or directory.
2015-09-08 00:49:16 CEST [2912] STATEMENT: delete from StockTransfer
where oid = $1 and tanum = $2

When they called us later, it turned out that the crash happened during a
base backup, leaving a backup_label behind which prevented the database
coming up again with a invalid checkpoint location. However, removing the
backup_label still didn't let the database through recovery, it failed
again with the former error, this time during recovery:

2015-09-08 11:40:04 CEST [27047] LOG: database system was interrupted
while in recovery at 2015-09-08 11:19:44 CEST
2015-09-08 11:40:04 CEST [27047] HINT: This probably means that some data
is corrupted and you will have to use the last backup for recovery.
2015-09-08 11:40:04 CEST [27047] LOG: database system was not properly
shut down; automatic recovery in progress
2015-09-08 11:40:05 CEST [27047] LOG: redo starts at 1A52/2313FEF8
2015-09-08 11:40:47 CEST [27082] FATAL: the database system is starting up
2015-09-08 11:40:59 CEST [27047] FATAL: could not access status of
transaction 1068235595
2015-09-08 11:40:59 CEST [27047] DETAIL: Could not seek in file
"pg_multixact/members/FFFF5FC4" to offset 4294950912: Invalid argument.
2015-09-08 11:40:59 CEST [27047] CONTEXT: xlog redo create mxid 1068235595
offset 2147483648 nmembers 2: 2896635220 (upd) 2896635510 (keysh)
2015-09-08 11:40:59 CEST [27045] LOG: startup process (PID 27047) exited
with exit code 1
2015-09-08 11:40:59 CEST [27045] LOG: aborting startup due to startup
process failure

Some side notes:

An additional recovery from a base backup and archive recovery yield to the
same error, as soon as the affected tuple was touched with a DELETE. The
affected table was fully dumpable via pg_dump, though.

We also have a core dump, but no direct access to the machine. If there's
more information required (and i believe it is), let me know where to dig
deeper. I also would like to request a backtrace from the existing core
dump, but in the absence of a sparc64 machine here we need to ask the
customer to get one.

--
Thanks

Bernd

Responses

Re: 9.3.9 and pg_multixact corruption at 2015-09-10 21:39:32 from Alvaro Herrera
Re: 9.3.9 and pg_multixact corruption at 2015-09-10 22:35:34 from Alvaro Herrera
Re: 9.3.9 and pg_multixact corruption at 2015-09-10 22:45:46 from Alvaro Herrera
Re: 9.3.9 and pg_multixact corruption at 2015-09-11 12:25:39 from Christoph Berg

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Alvaro Herrera	2015-09-10 21:28:32	Re: statistics for array types
Previous Message	Robert Haas	2015-09-10 21:24:00	Re: Foreign join pushdown vs EvalPlanQual