Re: 9.3.9 and pg_multixact corruption

From: Christoph Berg <christoph(dot)berg(at)credativ(dot)de>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Re: 9.3.9 and pg_multixact corruption
Date: 2015-09-11 12:25:39
Message-ID: 20150911122538.GA2672@msg.df7cb.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Re: Bernd Helmle 2015-09-10 <7E3C7F8D210AC9A423E96F3A(at)eje(dot)local>
> 2015-09-08 11:40:59 CEST [27047] DETAIL: Could not seek in file
> "pg_multixact/members/FFFF5FC4" to offset 4294950912: Invalid argument.
> 2015-09-08 11:40:59 CEST [27047] CONTEXT: xlog redo create mxid 1068235595
> offset 2147483648 nmembers 2: 2896635220 (upd) 2896635510 (keysh)
> 2015-09-08 11:40:59 CEST [27045] LOG: startup process (PID 27047) exited
> with exit code 1
> 2015-09-08 11:40:59 CEST [27045] LOG: aborting startup due to startup
> process failure
>
> Some side notes:
>
> An additional recovery from a base backup and archive recovery yield to the
> same error, as soon as the affected tuple was touched with a DELETE. The
> affected table was fully dumpable via pg_dump, though.

A few more words here: the archive recovery was a pitr to 00:45, so
well before the problem, and the cluster was initially working well,
but crashed shortly after with the same mxid 1068235595 message. The
crash was triggered from a delete on a different table (which was
related schema-wise, but iirc neither of these tables has any FKs).

We then rewound the system to a zfs snapshot taken when the archive
recovery had finished (db shut down cleanly), and put it up again,
when it again crashed with mxid 1068235595, this time on a third
table.

The original crash and the first post-recovery crash happened a few
minutes after pg_start_backup(), though the next crash was without
that.

(While the archive recovery was running, I had pg_resetxlog the
original cluster. It was possible to isolate the ctid of an affected
tuple, but it wasn't possible to DELETE it, yielding an error message
similar to the above, but the database would continue. I then zeroed
the bad block using dd (zero_damaged_pages didn't help), only to find
that at least one more tuple in that table was affected (with a
different mxid).)

Christoph

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Jinyu Zhang 2015-09-11 12:28:27 Did we forget to unpin buf in function "revmap_physical_extend" ?
Previous Message Zhaomo Yang 2015-09-11 12:21:34 Re: CREATE POLICY and RETURNING