Re: txid failed epoch increment, again, aka 6291

From: Daniel Farina <daniel(at)heroku(dot)com>
To: Noah Misch <noah(at)leadboat(dot)com>
Cc: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: txid failed epoch increment, again, aka 6291
Date: 2012-09-07 08:37:57
Message-ID: CAAZKuFbDRuvL7i5_wheWYud7yFf69Nmnq+0XTBfTCFyR0B_gAw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Thu, Sep 6, 2012 at 3:04 AM, Noah Misch <noah(at)leadboat(dot)com> wrote:
> On Tue, Sep 04, 2012 at 09:46:58AM -0700, Daniel Farina wrote:
>> I might try to find the segments leading up to the overflow point and
>> try xlogdumping them to see what we can see.
>
> That would be helpful to see.
>
> Just to grasp at yet-flimsier straws, could you post (URL preferred, else
> private mail) the output of "objdump -dS" on your "postgres" executable?

https://dl.dropbox.com/s/444ktxbrimaguxu/txid-wrap-objdump-dS-postgres.txt.gz

Sure, it's a 9.0.6 with pg_cancel_backend by-same-role backported
along with the standard debian changes, so nothing all that
interesting should be going on that isn't going on normally with
compilers on this platform. I am also starting to grovel through this
assembly, although I don't have a ton of experience finding problems
this way.

To save you a tiny bit of time aligning the assembly with the C, this line

c797f: e8 7c c9 17 00 callq 244300 <LWLockAcquire>

Seems to be the beginning of:

LWLockAcquire(XidGenLock, LW_SHARED);
checkPoint.nextXid = ShmemVariableCache->nextXid;
checkPoint.oldestXid = ShmemVariableCache->oldestXid;
checkPoint.oldestXidDB = ShmemVariableCache->oldestXidDB;
LWLockRelease(XidGenLock);

>> If there's anything to note about the workload, I'd say that it does
>> tend to make fairly pervasive use of long running transactions which
>> can span probably more than one checkpoint, and the txid reporting
>> functions, and a concurrency level of about 300 or so backends ... but
>> per my reading of the mechanism so far, it doesn't seem like any of
>> this should matter.
>
> Thanks for the details; I agree none of that sounds suspicious.
>
> After some further pondering and testing, this remains a mystery to me. These
> symptoms imply a proper update of ControlFile->checkPointCopy.nextXid without
> having properly updated ControlFile->checkPointCopy.nextXidEpoch. After
> recovery, only CreateCheckPoint() updates ControlFile->checkPointCopy at all.
> Its logic for doing so looks simple and correct.

Yeah. I'm pretty flabbergasted that so much seems to be going right
while this goes wrong.

--
fdr

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Pavan Deolasee 2012-09-07 09:19:32 Re: BUG #7521: Cannot disable WAL log while using pg_dump
Previous Message Gezeala M. Bacuño II 2012-09-07 06:45:18 Re: BUG #7521: Cannot disable WAL log while using pg_dump