|From:||Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>|
|To:||Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>|
|Cc:||Bernd Helmle <bernd(at)oopsware(dot)de>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>|
|Subject:||Re: 9.3.9 and pg_multixact corruption|
|Views:||Raw Message | Whole Thread | Download mbox|
On Fri, Sep 11, 2015 at 10:45 AM, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>
> Bernd Helmle wrote:
> > A customer had a severe issue with a PostgreSQL 9.3.9/sparc64/Solaris 11
> > instance.
> > The database crashed with the following log messages:
> > 2015-09-08 00:49:16 CEST  PANIC: could not access status of
> > transaction 1068235595
> > 2015-09-08 00:49:16 CEST  DETAIL: Could not open file
> > "pg_multixact/members/FFFF5FC4": No such file or directory.
> > 2015-09-08 00:49:16 CEST  STATEMENT: delete from StockTransfer
> > where oid = $1 and tanum = $2
> I wonder if these bogus page and offset numbers are just
> SlruReportIOError being confused because pg_multixact/members is so
> weird (I don't think it should be the case, since this stuff is using
> page numbers only, not anything related to how each page is layed out).
But SlruReportIOError uses the same macro to build the filename as
SlruReadPhysicalPage and other functions, namely SlruFileName which uses
sprintf with %04X (unsigned integer uppercase hex) and gives it segno
(which is always an int), so I don't think the problem is in error
Assuming default block size, to get FFFF5FC4 from SlruFileName you need
segno == -41020.
We have int segno = pageno / 32 (that's SLRU_PAGES_PER_SEGMENT), so to get
segno == -41020 you need pageno between -1312640 and -1312609 (whose bit
patterns reinterpreted as unsigned are 4293654656 and 4293654687).
In various places we have int pageno = offset / (uint32) 1636, expanded
from this macro (which calls the offset an xid):
#define MXOffsetToMemberPage(xid) ((xid) / (TransactionId)
I don't really see how any uint32 value could produce such a pageno via
that macro. Even if called in an environment where (xid) is accidentally
an int, the int / unsigned expression would convert it to unsigned first
(unless (xid) is a bigger type like int64_t: by the rules of int promotion
you'd get signed division in that case, hmm...). But it's always called
with a MultiXactOffset AKA uint32 variable.
So via that route, there is no MultiXactOffset value that can't be mapped
to a segment in the range "0000", "14078". Famously, it wraps after that.
Maybe the negative pageno came from somewhere else. Where? Inside SLRU
code we can see pageno = shared->page_number[slotno]... maybe the SLRU
slots got corrupted somehow?
|Next Message||Thomas Munro||2015-09-10 23:58:48||Re: 9.3.9 and pg_multixact corruption|
|Previous Message||Костя Кузнецов||2015-09-10 22:52:30||New gist vacuum.|