txid failed epoch increment, again, aka 6291

From: Daniel Farina <daniel(at)heroku(dot)com>
To: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: txid failed epoch increment, again, aka 6291
Date: 2012-09-04 09:07:30
Message-ID: CAAZKuFbnPYje8Xax2mMpCVfC4Mh8CZ3aT2Hd9z3W5GOLkce+_Q@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

It seems like this has reproduced once more. And once again, there
doesn't appear to be any funny business in pg_control (but the structs
are pasted here for your re-check), and there are successful sensical
updates to it. The primary is running 9.0.6.

However, we do have a new piece of data: there was a very brief period
where txid_snapshot did report an xmin greater than 2^33, our next
epoch boundary, by a few thousand transactions. That could be because
the reporting function GetNextXidAndEpoch does its own epoch
calculation before the checkpoint and then after a checkpoint that
forgets to increment the epoch there is no need to add post-facto
adjust the epoch anymore.

I've been reviewing the mechanism in CreateCheckPoint for this on and
off for a couple of days, but so far I haven't come up with a
convincing mechanism. However, given that it seems historically that
this bug is more likely to surface than *not* surface on this system,
perhaps we can try for a sometimes-reproducing test case. I'm still
struggling for a hint of a solution, though, so toss your thoughts
here.

$2 = {Insert = {LogwrtResult = {Write = {xlogid = 0, xrecoff = 0},
Flush = {xlogid = 0, xrecoff = 0}}, PrevRecord = {xlogid = 0,
xrecoff = 0}, curridx = 0, currpage = 0x7ff4ed04a000, currpos =
0x0, RedoRecPtr = {xlogid = 18751, xrecoff = 1200832888},
forcePageWrites = 0 '\000'}, LogwrtRqst = {Write = {xlogid = 0,
xrecoff = 0}, Flush = {xlogid = 0, xrecoff = 0}}, LogwrtResult = {
Write = {xlogid = 0, xrecoff = 0}, Flush = {xlogid = 0, xrecoff =
0}}, ckptXidEpoch = 1, ckptXid = 9904084, asyncXactLSN = {
xlogid = 0, xrecoff = 0}, lastRemovedLog = 0, lastRemovedSeg = 0,
Write = {LogwrtResult = {Write = {xlogid = 0, xrecoff = 0},
Flush = {xlogid = 0, xrecoff = 0}}, curridx = 0,
lastSegSwitchTime = 0}, pages = 0x7ff4ed04a000 "", xlblocks =
0x7ff4ed0471d8,
XLogCacheBlck = 1023, ThisTimeLineID = 0, RecoveryTargetTLI = 6,
archiveCleanupCommand = '\000' <repeats 1023 times>,
SharedRecoveryInProgress = 1 '\001', lastCheckPointRecPtr = {xlogid
= 18751, xrecoff = 1671519088}, lastCheckPoint = {redo = {
xlogid = 18751, xrecoff = 1200832888}, ThisTimeLineID = 6,
nextXidEpoch = 1, nextXid = 9904084, nextOid = 2047524,
nextMulti = 1119, nextMultiOffset = 3115, oldestXid = 4115479553,
oldestXidDB = 1, time = 1346746796, oldestActiveXid = 9776547},
replayEndRecPtr = {xlogid = 18751, xrecoff = 1748623656},
recoveryLastRecPtr = {xlogid = 18751, xrecoff = 1748623656},
recoveryLastXTime = 400062234671833, info_lck = 0 '\000'}
(gdb) p ControlFile
$3 = (ControlFileData *) 0x7ff4ed046bf8
(gdb) p *ControlFile
$4 = {system_identifier = 5613733157253676693, pg_control_version =
903, catalog_version_no = 201008051,
state = DB_IN_ARCHIVE_RECOVERY, time = 1346746898, checkPoint =
{xlogid = 18751, xrecoff = 1072693824}, prevCheckPoint = {
xlogid = 18751, xrecoff = 1072693824}, checkPointCopy = {redo =
{xlogid = 18751, xrecoff = 602482536}, ThisTimeLineID = 6,
nextXidEpoch = 1, nextXid = 9904084, nextOid = 2047524, nextMulti
= 1119, nextMultiOffset = 3115, oldestXid = 4115479553,
oldestXidDB = 1, time = 1346746496, oldestActiveXid = 9558248},
minRecoveryPoint = {xlogid = 18751, xrecoff = 1748623656},
backupStartPoint = {xlogid = 0, xrecoff = 0}, wal_level = 2,
MaxConnections = 500, max_prepared_xacts = 500, max_locks_per_xact =
64,
maxAlign = 8, floatFormat = 1234567, blcksz = 8192, relseg_size =
131072, xlog_blcksz = 8192, xlog_seg_size = 16777216,
nameDataLen = 64, indexMaxKeys = 32, toast_max_chunk_size = 1996,
enableIntTimes = 1 '\001', float4ByVal = 1 '\001',
float8ByVal = 1 '\001', crc = 3725972657}

--
fdr

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Etsuro Fujita 2012-09-04 09:30:37 Minor document updates
Previous Message Magnus Hagander 2012-09-04 07:29:32 Re: [GENERAL] Reduce the time to know trigger_fi​le's existence