Quick Links

Re: txid failed epoch increment, again, aka 6291

From:	Daniel Farina <daniel(at)heroku(dot)com>
To:	Noah Misch <noah(at)leadboat(dot)com>
Cc:	pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: txid failed epoch increment, again, aka 6291
Date:	2012-09-07 08:37:57
Message-ID:	CAAZKuFbDRuvL7i5_wheWYud7yFf69Nmnq+0XTBfTCFyR0B_gAw@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

On Thu, Sep 6, 2012 at 3:04 AM, Noah Misch <noah(at)leadboat(dot)com> wrote:
> On Tue, Sep 04, 2012 at 09:46:58AM -0700, Daniel Farina wrote:
>> I might try to find the segments leading up to the overflow point and
>> try xlogdumping them to see what we can see.
>
> That would be helpful to see.
>
> Just to grasp at yet-flimsier straws, could you post (URL preferred, else
> private mail) the output of "objdump -dS" on your "postgres" executable?

https://dl.dropbox.com/s/444ktxbrimaguxu/txid-wrap-objdump-dS-postgres.txt.gz

Sure, it's a 9.0.6 with pg_cancel_backend by-same-role backported
along with the standard debian changes, so nothing all that
interesting should be going on that isn't going on normally with
compilers on this platform. I am also starting to grovel through this
assembly, although I don't have a ton of experience finding problems
this way.

To save you a tiny bit of time aligning the assembly with the C, this line

c797f: e8 7c c9 17 00 callq 244300 <LWLockAcquire>

Seems to be the beginning of:

LWLockAcquire(XidGenLock, LW_SHARED);
checkPoint.nextXid = ShmemVariableCache->nextXid;
checkPoint.oldestXid = ShmemVariableCache->oldestXid;
checkPoint.oldestXidDB = ShmemVariableCache->oldestXidDB;
LWLockRelease(XidGenLock);

>> If there's anything to note about the workload, I'd say that it does
>> tend to make fairly pervasive use of long running transactions which
>> can span probably more than one checkpoint, and the txid reporting
>> functions, and a concurrency level of about 300 or so backends ... but
>> per my reading of the mechanism so far, it doesn't seem like any of
>> this should matter.
>
> Thanks for the details; I agree none of that sounds suspicious.
>
> After some further pondering and testing, this remains a mystery to me. These
> symptoms imply a proper update of ControlFile->checkPointCopy.nextXid without
> having properly updated ControlFile->checkPointCopy.nextXidEpoch. After
> recovery, only CreateCheckPoint() updates ControlFile->checkPointCopy at all.
> Its logic for doing so looks simple and correct.

Yeah. I'm pretty flabbergasted that so much seems to be going right
while this goes wrong.

--
fdr

In response to

Re: txid failed epoch increment, again, aka 6291 at 2012-09-06 10:04:06 from Noah Misch

Responses

Re: txid failed epoch increment, again, aka 6291 at 2012-09-07 12:49:17 from Noah Misch

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Pavan Deolasee	2012-09-07 09:19:32	Re: BUG #7521: Cannot disable WAL log while using pg_dump
Previous Message	Gezeala M. Bacuño II	2012-09-07 06:45:18	Re: BUG #7521: Cannot disable WAL log while using pg_dump