Re: LWLock deadlock and gdb advice

From: Jeff Janes <jeff(dot)janes(at)gmail(dot)com>
To: Andres Freund <andres(at)anarazel(dot)de>
Cc: Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, Peter Geoghegan <pg(at)heroku(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: LWLock deadlock and gdb advice
Date: 2015-07-31 16:48:28
Message-ID: CAMkU=1wS8R7rz67jYEZoc4P_rXp0frbZerdP1gcCbE=w8fpMvw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Thu, Jul 30, 2015 at 8:22 PM, Andres Freund <andres(at)anarazel(dot)de> wrote:

> On 2015-07-30 09:03:01 -0700, Jeff Janes wrote:
> > On Wed, Jul 29, 2015 at 6:10 AM, Andres Freund <andres(at)anarazel(dot)de>
> wrote:
> > > What do you think about something roughly like the attached?
> > >
> >
> > I've not evaluated the code, but applying it does solve the problem I was
> > seeing.
>
> Cool, thanks for testing! How long did you run the test and how long did
> it, on average, previously take for the issue to occur?
>

I had run it for 24 hours, while it usually took less than 8 hours to look
up before. I did see it within a few minutes one time when I checked out a
new HEAD and then forgot to re-apply your or Heikki's patch.

But now I've got the same situation again, after 15 hours, with your
patch. This is probably all down to luck. The only differences that I can
think of is that I advanced the base to e8e86fbc8b3619da54c, and turned on
JJ_vac and set log_autovacuum_min_duration=0.

checkpointer, 31516
(gdb) bt
#0 0x0000003dcb6eaf27 in semop () from /lib64/libc.so.6
#1 0x00000000006481df in PGSemaphoreLock (sema=0x7f84ccd3c5f8) at
pg_sema.c:387
#2 0x00000000006a5754 in LWLockAcquireCommon (l=0x7f84c42edf80,
valptr=0x7f84c42edfa8, val=18446744073709551615) at lwlock.c:1041
#3 LWLockAcquireWithVar (l=0x7f84c42edf80, valptr=0x7f84c42edfa8,
val=18446744073709551615) at lwlock.c:915
#4 0x00000000004cbd86 in WALInsertLockAcquireExclusive () at xlog.c:1445
#5 0x00000000004d4195 in CreateCheckPoint (flags=64) at xlog.c:8181
#6 0x000000000064ea53 in CheckpointerMain () at checkpointer.c:493
#7 0x00000000004e328a in AuxiliaryProcessMain (argc=2,
argv=0x7fff6eed1f50) at bootstrap.c:428
#8 0x00000000006563f7 in StartChildProcess (type=CheckpointerProcess) at
postmaster.c:5167
#9 0x000000000065884a in reaper (postgres_signal_arg=<value optimized
out>) at postmaster.c:2670

(gdb) f 2
(gdb) p *(lock->owner)
$1 = {links = {prev = 0x0, next = 0x0}, procgloballist = 0x7f84ccd27c88,
sem = {semId = 1480359975, semNum = 12}, waitStatus = 0, procLatch =
{is_set = 1,
is_shared = 1 '\001', owner_pid = 31540}, lxid = 3209, pid = 31540,
pgprocno = 92, backendId = 12, databaseId = 16384, roleId = 10,
recoveryConflictPending = 0 '\000', lwWaiting = 1 '\001', lwWaitMode = 2
'\002', lwWaitLink = {prev = 0x7f84c42edf10, next = 0x7f84ccd38e60},
waitLock = 0x0, waitProcLock = 0x0, waitLockMode = 0, heldLocks = 0,
waitLSN = 0, syncRepState = 0, syncRepLinks = {prev = 0x0, next = 0x0},
myProcLocks = {{prev = 0x7f84ccd38bb8, next = 0x7f84ccd38bb8}, {prev =
0x7f84ccd38bc8, next = 0x7f84ccd38bc8}, {prev = 0x7f84ccd38bd8,
next = 0x7f84ccd38bd8}, {prev = 0x7f84ccd38be8, next =
0x7f84ccd38be8}, {prev = 0x7f84ccd38bf8, next = 0x7f84ccd38bf8}, {prev =
0x7f84ccd38c08,
next = 0x7f84ccd38c08}, {prev = 0x7f84ccd38c18, next =
0x7f84ccd38c18}, {prev = 0x7f84ccd38c28, next = 0x7f84ccd38c28}, {prev =
0x7f84ccd38c38,
next = 0x7f84ccd38c38}, {prev = 0x7f84ccd38c48, next =
0x7f84ccd38c48}, {prev = 0x7f84ccd38c58, next = 0x7f84ccd38c58}, {prev =
0x7f84ccd38c68,
next = 0x7f84ccd38c68}, {prev = 0x7f84ccd38c78, next =
0x7f84ccd38c78}, {prev = 0x7f84ccd38c88, next = 0x7f84ccd38c88}, {prev =
0x7f84ccb0ceb0,
next = 0x7f84ccb0ceb0}, {prev = 0x7f84ccd38ca8, next =
0x7f84ccd38ca8}}, subxids = {xids = {0 <repeats 64 times>}}, backendLock =
0x7f84c42e9800,
fpLockBits = 160528697655296, fpRelId = {0, 0, 0, 0, 0, 0, 0, 0, 0, 2679,
2610, 16413, 16411, 16415, 16414, 16408}, fpVXIDLock = 1 '\001',
fpLocalTransactionId = 3209}

backend 31540:

#0 0x0000003dcb6eaf27 in semop () from /lib64/libc.so.6
#1 0x00000000006481df in PGSemaphoreLock (sema=0x7f84ccd38b38) at
pg_sema.c:387
#2 0x00000000006a53d4 in LWLockWaitForVar (lock=0x7f84c42edf00,
valptr=0x7f84c42edf28, oldval=0, newval=0x7fff6eeceaf8) at lwlock.c:1438
#3 0x00000000004d3276 in WaitXLogInsertionsToFinish (upto=183341449216) at
xlog.c:1576
#4 0x00000000004d3b33 in AdvanceXLInsertBuffer (upto=183345636352,
opportunistic=<value optimized out>) at xlog.c:1888
#5 0x00000000004d8df2 in GetXLogBuffer (ptr=183345636352) at xlog.c:1669
#6 0x00000000004d8fea in CopyXLogRecordToWAL (rdata=0xbd2410,
fpw_lsn=<value optimized out>) at xlog.c:1276
#7 XLogInsertRecord (rdata=0xbd2410, fpw_lsn=<value optimized out>) at
xlog.c:1008
#8 0x00000000004dc72c in XLogInsert (rmid=13 '\r', info=32 ' ') at
xloginsert.c:453
#9 0x0000000000478f41 in ginPlaceToPage (btree=0x7fff6eecee20,
stack=0x1be11e0, insertdata=<value optimized out>, updateblkno=<value
optimized out>,
childbuf=0, buildStats=0x0) at ginbtree.c:418
#10 0x00000000004798a3 in ginInsertValue (btree=0x7fff6eecee20,
stack=0x1be11e0, insertdata=0x7fff6eeceea0, buildStats=0x0) at
ginbtree.c:748
#11 0x00000000004734aa in ginEntryInsert (ginstate=0x7fff6eecf200,
attnum=39408, key=1, category=<value optimized out>, items=0x1c81c20,
nitem=54,
buildStats=0x0) at gininsert.c:234
#12 0x000000000047edbb in ginInsertCleanup (ginstate=0x7fff6eecf200,
vac_delay=<value optimized out>, stats=0x0) at ginfast.c:843
#13 0x000000000047fae6 in ginHeapTupleFastInsert (ginstate=0x7fff6eecf200,
collector=<value optimized out>) at ginfast.c:436
#14 0x000000000047386a in gininsert (fcinfo=<value optimized out>) at
gininsert.c:531
...

(gdb) f 2
(gdb) p *(lock->owner)
$1 = {links = {prev = 0x0, next = 0x0}, procgloballist = 0x0, sem = {semId
= 1480425513, semNum = 0}, waitStatus = 0, procLatch = {is_set = 0,
is_shared = 1 '\001', owner_pid = 31516}, lxid = 0, pid = 31516,
pgprocno = 112, backendId = -1, databaseId = 0, roleId = 0,
recoveryConflictPending = 0 '\000', lwWaiting = 1 '\001', lwWaitMode = 0
'\000', lwWaitLink = {prev = 0x7f84ccd3cc10, next = 0x7f84ccd3a5e0},
waitLock = 0x0, waitProcLock = 0x0, waitLockMode = 0, heldLocks = 0,
waitLSN = 0, syncRepState = 0, syncRepLinks = {prev = 0x0, next = 0x0},
myProcLocks = {{prev = 0x7f84ccd3c678, next = 0x7f84ccd3c678}, {prev =
0x7f84ccd3c688, next = 0x7f84ccd3c688}, {prev = 0x7f84ccd3c698,
next = 0x7f84ccd3c698}, {prev = 0x7f84ccd3c6a8, next =
0x7f84ccd3c6a8}, {prev = 0x7f84ccd3c6b8, next = 0x7f84ccd3c6b8}, {prev =
0x7f84ccd3c6c8,
next = 0x7f84ccd3c6c8}, {prev = 0x7f84ccd3c6d8, next =
0x7f84ccd3c6d8}, {prev = 0x7f84ccd3c6e8, next = 0x7f84ccd3c6e8}, {prev =
0x7f84ccd3c6f8,
next = 0x7f84ccd3c6f8}, {prev = 0x7f84ccd3c708, next =
0x7f84ccd3c708}, {prev = 0x7f84ccd3c718, next = 0x7f84ccd3c718}, {prev =
0x7f84ccd3c728,
next = 0x7f84ccd3c728}, {prev = 0x7f84ccd3c738, next =
0x7f84ccd3c738}, {prev = 0x7f84ccd3c748, next = 0x7f84ccd3c748}, {prev =
0x7f84ccd3c758,
next = 0x7f84ccd3c758}, {prev = 0x7f84ccd3c768, next =
0x7f84ccd3c768}}, subxids = {xids = {0 <repeats 64 times>}}, backendLock =
0x7f84c42e9d00,
fpLockBits = 0, fpRelId = {0 <repeats 16 times>}, fpVXIDLock = 0 '\000',
fpLocalTransactionId = 0}

Cheers,

Jeff

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Robert Haas 2015-07-31 17:31:54 Re: Re: Using quicksort and a merge step to significantly improve on tuplesort's single run "external sort"
Previous Message Jesper Pedersen 2015-07-31 15:56:20 RequestAddinLWLocks(int n)