Deadlock condition in current sources

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: pgsql-hackers(at)postgreSQL(dot)org
Subject: Deadlock condition in current sources
Date: 2001-12-18 03:29:01
Message-ID: 9598.1008646141@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

I have observed a nasty three-way deadlock condition.

This proc is trying to generate a new transaction ID, and has hit the one
case in every 32K where a new page must be added to the CLOG. That
means that an XLOG record must be written to record the creation of the
new CLOG page:

USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
tgl 1135 0.0 3.4 41012 8812 pts/2 SN 19:54 0:00 postgres: tgl bench [local] idle

#0 0x401d63b2 in semop (semid=1474560, sops=0xbfffcd20, nsops=1) at ../sysdeps/unix/sysv/linux/semop.c:36
#1 0x0811ccab in IpcSemaphoreLock (semId=1474560, sem=4, interruptOK=0 '\000') at ipc.c:422
#2 0x0812332f in LWLockAcquire (lockid=WALInsertLock, mode=LW_EXCLUSIVE) at lwlock.c:271
#3 0x08091d90 in XLogInsert (rmid=3 '\003', info=0 '\000', rdata=0xbfffef10) at xlog.c:644
#4 0x08090237 in WriteZeroPageXlogRec (pageno=2) at clog.c:962
#5 0x0808f7e0 in ZeroCLOGPage (pageno=2, writeXlog=1 '\001') at clog.c:357
This proc is holding CLogControlLock, LW_EXCLUSIVE:
#6 0x0808ff50 in ExtendCLOG (newestXact=65536) at clog.c:778
This proc is holding XidGenLock, LW_EXCLUSIVE:
#7 0x08090590 in GetNewTransactionId () at varsup.c:58
#8 0x08090d77 in StartTransaction () at xact.c:863
#9 0x080910f9 in StartTransactionCommand () at xact.c:1156
#10 0x08126753 in pg_exec_query_string (query_string=0x8271410 "begin", dest=Remote, parse_context=0x8247adc) at postgres.c:603
#11 0x081278da in PostgresMain (argc=4, argv=0xbffff1c0, username=0x822dce9 "tgl") at postgres.c:1849

The first proc is waiting for the second, who already holds WALInsertLock.
The second proc is trying to make the first XLOG entry of his transaction.
Therefore he needs to set MyProc->logRec, which presently requires him
to obtain SInvalLock:

USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
tgl 1196 0.0 3.5 41028 8928 pts/2 SN 19:54 0:00 postgres: tgl bench [local] UPDATE

#0 0x401d63b2 in semop (semid=1572867, sops=0xbfffcb50, nsops=1) at ../sysdeps/unix/sysv/linux/semop.c:36
#1 0x0811ccab in IpcSemaphoreLock (semId=1572867, sem=15, interruptOK=0 '\000') at ipc.c:422
#2 0x0812332f in LWLockAcquire (lockid=SInvalLock, mode=LW_EXCLUSIVE) at lwlock.c:271
This proc is holding WALInsertLock, LW_EXCLUSIVE:
#3 0x0809222f in XLogInsert (rmid=10 '\n', info=40 '(', rdata=0xbfffed50) at xlog.c:747
#4 0x08079f4e in log_heap_update (reln=0x425c7fe0, oldbuf=238, from={ip_blkid = {bi_hi = 1, bi_lo = 5639}, ip_posid = 16},
newbuf=3307, newtup=0x82865e8, move=0 '\000') at heapam.c:1931
#5 0x0807948f in heap_update (relation=0x425c7fe0, otid=0xbfffef10, newtup=0x82865e8, ctid=0xbfffee80) at heapam.c:1565
#6 0x080d6216 in ExecReplace (slot=0x827a9ec, tupleid=0xbfffef10, estate=0x827ae38) at execMain.c:1454
#7 0x080d5f1d in ExecutePlan (estate=0x827ae38, plan=0x827ad90, operation=CMD_UPDATE, numberTuples=0,
direction=ForwardScanDirection, destfunc=0x827b6e4) at execMain.c:1129
#8 0x080d5260 in ExecutorRun (queryDesc=0x827ae1c, estate=0x827ae38, feature=3, count=0) at execMain.c:233
#9 0x08127e13 in ProcessQuery (parsetree=0x8272148, plan=0x827ad90, dest=Remote) at pquery.c:293
#10 0x08126942 in pg_exec_query_string (
query_string=0x8271d90 "update accounts set abalance = abalance + 735 where aid = 4270516\n", dest=Remote,
parse_context=0x824845c) at postgres.c:781
#11 0x081278da in PostgresMain (argc=4, argv=0xbffff1c0, username=0x822dce9 "tgl") at postgres.c:1849

And this proc is trying to obtain XidGenLock while already holding
SInvalLock:

USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
tgl 1138 0.0 3.5 41020 8936 pts/2 SN 19:54 0:00 postgres: tgl bench [local] idle in transaction

#0 0x401d63b2 in semop (semid=1474560, sops=0xbfffef00, nsops=1) at ../sysdeps/unix/sysv/linux/semop.c:36
#1 0x0811ccab in IpcSemaphoreLock (semId=1474560, sem=7, interruptOK=0 '\000') at ipc.c:422
#2 0x0812332f in LWLockAcquire (lockid=XidGenLock, mode=LW_SHARED) at lwlock.c:271
#3 0x080905d4 in ReadNewTransactionId () at varsup.c:103
This proc is holding SInvalLock, LW_SHARED:
#4 0x0811e2ae in GetSnapshotData (serializable=0 '\000') at sinval.c:359
#5 0x081767ce in SetQuerySnapshot () at tqual.c:752
#6 0x081268f9 in pg_exec_query_string (
query_string=0x8271458 "insert into history(tid,bid,aid,delta,mtime) values(336,81,9860149,356,'now')", dest=Remote,
parse_context=0x8247b0c) at postgres.c:764
#7 0x081278da in PostgresMain (argc=4, argv=0xbffff1c0, username=0x822dce9 "tgl") at postgres.c:1849

Unfortunately the first proc is holding XidGenLock, ergo deadlock.

I don't think we have any room to wiggle in terms of the locking
sequence of the first proc (see comments in GetNewTransactionId),
nor of the third (see comments in GetSnapshotData). That means
the only way to resolve the deadlock is to not grab SInvalLock
while holding the WALInsertLock in XLogInsert.

I believe this is actually safe, because the only code that looks at the
logRec fields of other backends' PROC structures is GetUndoRecPtr,
which is only called while holding WALInsertLock in CreateCheckPoint.
Therefore, we could re-document proc->logRec as being protected by
WALInsertLock not SInvalLock and not have to get SInvalLock in
XLogInsert.

However, there's still a problem: GetUndoRecPtr also gets SInvalLock
while its caller holds WALInsertLock, and therefore this routine
could create the second leg of the deadlock too. Removing the
SInvalLock lock there creates the problem that backends might be
added to or deleted from the PROC array while GetUndoRecPtr runs.
I think it might be possible to survive that, by adding an assumption
that logRec.xrecoff can be set to zero atomically, but it seems tricky.

Comments? Anyone see a better approach?

regards, tom lane

Browse pgsql-hackers by date

  From Date Subject
Next Message Bruce Momjian 2001-12-18 03:42:03 Re: Connection Pooling, a year later
Previous Message mlw 2001-12-18 02:34:25 Re: Connection Pooling, a year later