Re: ERROR: MultiXactId xxxx has not been created yet -- apparent wraparound

From: Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>
To: Paul Smith <paul(at)pscs(dot)co(dot)uk>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: ERROR: MultiXactId xxxx has not been created yet -- apparent wraparound
Date: 2015-05-26 18:47:36
Message-ID: 20150526184736.GX5885@postgresql.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Paul Smith wrote:

> No, nothing like that. It was just running fine, and then suddenly (at 2am
> on 23 May) it started throwing up loads of these errors. The DB server
> wasn't even restarted at that point. It was just working fine, then suddenly
> wasn't. (The first error was at 02:00:32 BST, then every few minutes after
> that there's another one).

Another crazy thought is that the host system messed up and overwrote
part of the table with pages from some other guest system. I have never
seen a report of such a thing happening.

> It's running in a Hyper-V guest. We had taken a backup of the VM at 00:34 on
> 23 May and that looks to be absolutely fine. What I have done now is restore
> that backup and import the new data which arrived since that backup was
> made, and it seems OK now. I still have the 'broken' installation in case
> more information is needed from it. I'd try to get a raw dump of the damaged
> tuple data if I knew how to find where it is in the relation file...

Well, you could try a binary search to locate the (or a) borked page
within the table. Or you could write a plpgsql function with an
EXCEPTION block that reads each tuple, and in case of error reports the
CTID of the tuple being read. (select * from tab where
ctid=generated_ctid; EXCEPT WHEN OTHERS THEN raise notice ' .. ',
generated_ctid). Once you have located the problem tuple, a pageinspect
dump would be the starting point.

Another idea is to attach a debugger to the backend process, and set a
breakpoint on function errfinish. Once the error is hit the breakpoint
will stop the execution before the error is reported; at that point you
can request for a backtrace from the debugger. (Make sure to examine
errordata to see whether the error being hit is the multixact one.)

> I suppose it's possible that it was disk or memory corruption, but I've seen
> that before, and it hasn't looked like this.

Yeah, same here.

--
Álvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Robert Haas 2015-05-26 18:55:15 Re: Run pgindent now?
Previous Message Josh Berkus 2015-05-26 18:20:58 Re: fsync bug faq for publication?