pg16b2: REINDEX segv on null pointer in RemoveFromWaitQueue

From: Justin Pryzby <pryzby(at)telsasoft(dot)com>
To: pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject: pg16b2: REINDEX segv on null pointer in RemoveFromWaitQueue
Date: 2023-07-11 02:01:37
Message-ID: ZKy4AdrLEfbqrxGJ@telsasoft.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

An instance compiled locally, without assertions, failed like this:

< 2023-07-09 22:04:46.470 UTC >LOG: process 30002 detected deadlock while waiting for ShareLock on transaction 813219577 after 333.228 ms
< 2023-07-09 22:04:46.470 UTC >DETAIL: Process holding the lock: 2103. Wait queue: 30001.
< 2023-07-09 22:04:46.470 UTC >CONTEXT: while checking uniqueness of tuple (549,4) in relation "pg_statistic_ext_data"
< 2023-07-09 22:04:46.470 UTC >STATEMENT: REINDEX INDEX pg_statistic_ext_data_stxoid_inh_index
< 2023-07-09 22:04:46.474 UTC >ERROR: deadlock detected
< 2023-07-09 22:04:46.474 UTC >DETAIL: Process 30001 waits for ShareLock on transaction 813219577; blocked by process 2103.
Process 2103 waits for RowExclusiveLock on relation 3429 of database 16588; blocked by process 30001.
Process 30001: REINDEX INDEX pg_statistic_ext_data_stxoid_inh_index
Process 2103: autovacuum: ANALYZE child.ericsson_sgsn_rac_202307
< 2023-07-09 22:04:46.474 UTC >HINT: See server log for query details.
< 2023-07-09 22:04:46.474 UTC >CONTEXT: while checking uniqueness of tuple (549,4) in relation "pg_statistic_ext_data"
< 2023-07-09 22:04:46.474 UTC >STATEMENT: REINDEX INDEX pg_statistic_ext_data_stxoid_inh_index
< 2023-07-09 22:04:46.483 UTC >LOG: background worker "parallel worker" (PID 30002) exited with exit code 1
< 2023-07-09 22:04:46.487 UTC postgres >ERROR: deadlock detected
< 2023-07-09 22:04:46.487 UTC postgres >DETAIL: Process 30001 waits for ShareLock on transaction 813219577; blocked by process 2103.
Process 2103 waits for RowExclusiveLock on relation 3429 of database 16588; blocked by process 30001.
< 2023-07-09 22:04:46.487 UTC postgres >HINT: See server log for query details.
< 2023-07-09 22:04:46.487 UTC postgres >CONTEXT: while checking uniqueness of tuple (549,4) in relation "pg_statistic_ext_data"
parallel worker
< 2023-07-09 22:04:46.487 UTC postgres >STATEMENT: REINDEX INDEX pg_statistic_ext_data_stxoid_inh_index
< 2023-07-09 22:04:46.848 UTC >LOG: server process (PID 30001) was terminated by signal 11: Segmentation fault
< 2023-07-09 22:04:46.848 UTC >DETAIL: Failed process was running: REINDEX INDEX pg_statistic_ext_data_stxoid_inh_index

=> REINDEX was running, with parallel workers, but deadlocked with
ANALYZE, and then crashed.

It looks like parallel workers are needed to hit this issue.
I'm not sure if the issue is specific to extended stats - probably not.

I reproduced the crash with manual REINDEX+ANALYZE, and with assertions (which
were not hit), and on a more recent commit (1124cb2cf). The crash is hit about
30% of the time when running a loop around REINDEX and then also running
ANALYZE.

I hope someone has a hunch where to look; so far, I wasn't able to create a
minimal reproducer.

Core was generated by `postgres: pryzbyj ts [local] REINDEX '.
Program terminated with signal 11, Segmentation fault.
#0 RemoveFromWaitQueue (proc=0x2aaabc1289e0, hashcode=2627626119) at ../src/backend/storage/lmgr/lock.c:1898
1898 LOCKMETHODID lockmethodid = LOCK_LOCKMETHOD(*waitLock);
(gdb) bt
#0 RemoveFromWaitQueue (proc=0x2aaabc1289e0, hashcode=2627626119) at ../src/backend/storage/lmgr/lock.c:1898
#1 0x00000000007ab56b in LockErrorCleanup () at ../src/backend/storage/lmgr/proc.c:735
#2 0x0000000000548a7e in AbortTransaction () at ../src/backend/access/transam/xact.c:2735
#3 0x0000000000549405 in AbortCurrentTransaction () at ../src/backend/access/transam/xact.c:3414
#4 0x00000000007b6414 in PostgresMain (dbname=<optimized out>, username=<optimized out>) at ../src/backend/tcop/postgres.c:4352
#5 0x0000000000730e9a in BackendRun (port=<optimized out>, port=<optimized out>) at ../src/backend/postmaster/postmaster.c:4461
#6 BackendStartup (port=0x12a8a50) at ../src/backend/postmaster/postmaster.c:4189
#7 ServerLoop () at ../src/backend/postmaster/postmaster.c:1779
#8 0x000000000073207d in PostmasterMain (argc=argc(at)entry=3, argv=argv(at)entry=0x127a230) at ../src/backend/postmaster/postmaster.c:1463
#9 0x00000000004b5535 in main (argc=3, argv=0x127a230) at ../src/backend/main/main.c:198

(gdb) l
1893 RemoveFromWaitQueue(PGPROC *proc, uint32 hashcode)
1894 {
1895 LOCK *waitLock = proc->waitLock;
1896 PROCLOCK *proclock = proc->waitProcLock;
1897 LOCKMODE lockmode = proc->waitLockMode;
1898 LOCKMETHODID lockmethodid = LOCK_LOCKMETHOD(*waitLock);
1899
1900 /* Make sure proc is waiting */
1901 Assert(proc->waitStatus == PROC_WAIT_STATUS_WAITING);
1902 Assert(proc->links.next != NULL);

(gdb) p waitLock
$1 = (LOCK *) 0x0

Another variant on this crash:

Jul 11 00:55:19 telsa kernel: postgres[25415]: segfault at f ip 000000000081111a sp 00007ffdbc01ea90 error 4 in postgres[400000+8df000]

Core was generated by `postgres: parallel worker for PID 27096 waiting '.

(gdb) bt
#0 RemoveFromWaitQueue (proc=0x2aaabc154040, hashcode=2029421528) at ../src/backend/storage/lmgr/lock.c:1874
#1 0x000000000081de2f in LockErrorCleanup () at ../src/backend/storage/lmgr/proc.c:735
#2 0x0000000000826990 in ProcessInterrupts () at ../src/backend/tcop/postgres.c:3207
#3 0x000000000081e355 in ProcSleep (locallock=locallock(at)entry=0x252a9d0, lockMethodTable=lockMethodTable(at)entry=0xee1260 <default_lockmethod>) at ../src/backend/storage/lmgr/proc.c:1295
#4 0x000000000080eff1 in WaitOnLock (locallock=locallock(at)entry=0x252a9d0, owner=owner(at)entry=0x253b548) at ../src/backend/storage/lmgr/lock.c:1818
#5 0x00000000008107ce in LockAcquireExtended (locktag=locktag(at)entry=0x7ffdbc01ee10, lockmode=lockmode(at)entry=5, sessionLock=sessionLock(at)entry=false, dontWait=dontWait(at)entry=false,
reportMemoryError=reportMemoryError(at)entry=true, locallockp=locallockp(at)entry=0x0) at ../src/backend/storage/lmgr/lock.c:1082
#6 0x00000000008110a4 in LockAcquire (locktag=locktag(at)entry=0x7ffdbc01ee10, lockmode=lockmode(at)entry=5, sessionLock=sessionLock(at)entry=false, dontWait=dontWait(at)entry=false) at ../src/backend/storage/lmgr/lock.c:740
#7 0x000000000080e316 in XactLockTableWait (xid=xid(at)entry=816478533, rel=rel(at)entry=0x7f7332090468, ctid=ctid(at)entry=0x2596374, oper=oper(at)entry=XLTW_InsertIndexUnique) at ../src/backend/storage/lmgr/lmgr.c:702
#8 0x00000000005190bb in heapam_index_build_range_scan (heapRelation=0x7f7332090468, indexRelation=0x7f7332099008, indexInfo=0x2596888, allow_sync=<optimized out>, anyvisible=false, progress=false, start_blockno=0,
numblocks=4294967295, callback=0x53c8c0 <_bt_build_callback>, callback_state=0x7ffdbc01f310, scan=0x2596318) at ../src/backend/access/heap/heapam_handler.c:1496
#9 0x000000000053ca77 in table_index_build_scan (scan=<optimized out>, callback_state=0x7ffdbc01f310, callback=0x53c8c0 <_bt_build_callback>, progress=false, allow_sync=true, index_info=0x2596888, index_rel=<optimized out>,
table_rel=<optimized out>) at ../src/include/access/tableam.h:1781
#10 _bt_parallel_scan_and_sort (btspool=btspool(at)entry=0x2596d08, btspool2=btspool2(at)entry=0x2596d38, btshared=btshared(at)entry=0x2aaaaad423a0, sharedsort=sharedsort(at)entry=0x2aaaaad42340,
sharedsort2=sharedsort2(at)entry=0x2aaaaad422e0, sortmem=<optimized out>, progress=progress(at)entry=false) at ../src/backend/access/nbtree/nbtsort.c:1985
#11 0x000000000053ef6f in _bt_parallel_build_main (seg=<optimized out>, toc=0x2aaaaad42080) at ../src/backend/access/nbtree/nbtsort.c:1888
#12 0x0000000000564ec8 in ParallelWorkerMain (main_arg=<optimized out>) at ../src/backend/access/transam/parallel.c:1520
#13 0x00000000007892f8 in StartBackgroundWorker () at ../src/backend/postmaster/bgworker.c:861
#14 0x0000000000493269 in do_start_bgworker (rw=0x2531fc0) at ../src/backend/postmaster/postmaster.c:5762
#15 maybe_start_bgworkers () at ../src/backend/postmaster/postmaster.c:5986
#16 0x000000000078dc5a in process_pm_pmsignal () at ../src/backend/postmaster/postmaster.c:5149
#17 ServerLoop () at ../src/backend/postmaster/postmaster.c:1770
#18 0x0000000000790635 in PostmasterMain (argc=argc(at)entry=3, argv=argv(at)entry=0x24fe230) at ../src/backend/postmaster/postmaster.c:1463
#19 0x00000000004b80c5 in main (argc=3, argv=0x24fe230) at ../src/backend/main/main.c:198

--
Justin

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Masahiko Sawada 2023-07-11 02:02:51 Re: Performance degradation on concurrent COPY into a single relation in PG16.
Previous Message Michael Paquier 2023-07-11 01:40:37 Re: Add more sanity checks around callers of changeDependencyFor()