Quick Links

Race condition in HEAD, possibly due to PGPROC splitup

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>, Heikki Linnakangas <heikki(at)enterprisedb(dot)com>, Pavan Deolasee <pavan(dot)deolasee(at)gmail(dot)com>
Cc:	pgsql-hackers(at)postgreSQL(dot)org
Subject:	Race condition in HEAD, possibly due to PGPROC splitup
Date:	2011-12-14 04:15:30
Message-ID:	27187.1323836130@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

If you add this Assert to lock.c:

diff --git a/src/backend/storage/lmgr/lock.c b/src/backend/storage/lmgr/lock.c
index 3ba4671..d9c15e0 100644
*** a/src/backend/storage/lmgr/lock.c
--- b/src/backend/storage/lmgr/lock.c
*************** GetRunningTransactionLocks(int *nlocks)
*** 3195,3200 ****
--- 3195,3202 ----
accessExclusiveLocks[index].dbOid = lock->tag.locktag_field1;
accessExclusiveLocks[index].relOid = lock->tag.locktag_field2;

+ Assert(TransactionIdIsNormal(accessExclusiveLocks[index].xid));
+
index++;
}
}

then set wal_level = hot_standby, and run the regression tests
repeatedly, the Assert will trigger eventually --- for me, it happens
within a dozen or so parallel iterations, or rather longer if I run
the tests serial style. Stack trace is unsurprising, since AFAIK this
is only called in the checkpointer:

#2 0x000000000073461d in ExceptionalCondition (
conditionName=<value optimized out>, errorType=<value optimized out>,
fileName=<value optimized out>, lineNumber=<value optimized out>)
at assert.c:57
#3 0x000000000065eca1 in GetRunningTransactionLocks (nlocks=0x7fffa997de8c)
at lock.c:3198
#4 0x00000000006582b8 in LogStandbySnapshot (nextXid=0x7fffa997dee0)
at standby.c:835
#5 0x00000000004b0b97 in CreateCheckPoint (flags=32) at xlog.c:7761
#6 0x000000000062bf92 in CheckpointerMain () at checkpointer.c:488
#7 0x00000000004cf465 in AuxiliaryProcessMain (argc=2, argv=0x7fffa997e110)
at bootstrap.c:424
#8 0x00000000006261f5 in StartChildProcess (type=CheckpointerProcess)
at postmaster.c:4487

The actual value of the bogus xid (which was pulled from
allPgXact[proc->pgprocno]->xid just above here) is zero. What I believe
is happening is that somebody is clearing his pgxact->xid entry
asynchronously to GetRunningTransactionLocks, and since that clearly
oughta be impossible, something is broken.

Without the added assert, you'd only notice this if you were running a
standby slave --- the zero xid results in an assert failure in WAL
replay on the slave end, which is how I found out about this to start
with. But since we've not heard reports of such before, I suspect that
this is a recently introduced bug; and personally I'd bet money that it
was the PGXACT patch that broke it.

I have other things to do than look into this right now myself.

regards, tom lane

Responses

Re: Race condition in HEAD, possibly due to PGPROC splitup at 2011-12-14 12:20:37 from Pavan Deolasee

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Tom Lane	2011-12-14 04:21:19	Re: NOTIFY with tuples
Previous Message	Robert Haas	2011-12-14 02:13:27	Re: NOTIFY with tuples