PostgreSQL crashes with SIGSEGV

From: Bernd Helmle <mailings(at)oopsware(dot)de>
To: pgsql-bugs(at)postgresql(dot)org
Subject: PostgreSQL crashes with SIGSEGV
Date: 2017-12-07 15:47:18
Message-ID: 1512661638.9720.34.camel@oopsware.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs pgsql-hackers

A customer recently reported a crash in a postgres backend. The backend
encountered a SIGSEGV, crashing during SELECTs from a fairly
complicated view using a grouping set directive. I've managed to
reproduce it by tracking it down to a specific SELECT, but
unfortunately couldn't yet manage to strip it down to a small,
repeatable test case which doesn't involve the whole (sensitive)
dataset. I'm reporting my findings so far, maybe it helps to track it
down already.

The crashed backend gives the following backtrace:

Program received signal SIGSEGV, Segmentation fault.
0x000000000099be84 in pfree (pointer=0x17194e8) at mcxt.c:1007
1007 context = ((StandardChunkHeader *)
(gdb) bt
#0 0x000000000099be84 in pfree (pointer=0x17194e8) at mcxt.c:1007
#1 0x00000000004798d8 in heap_free_minimal_tuple (mtup=0x17194e8) at
heaptuple.c:1403
#2 0x0000000000687202 in ExecClearTuple (slot=0x10a3f88) at
execTuples.c:455
#3 0x0000000000686b89 in ExecResetTupleTable (tupleTable=0x10a06a0,
shouldFree=0 '\000') at execTuples.c:169
#4 0x00000000006773ff in ExecEndPlan (planstate=0x10a0398,
estate=0x10a01f0) at execMain.c:1469
#5 0x0000000000675927 in standard_ExecutorEnd (queryDesc=0x10910c0) at
execMain.c:468
#6 0x0000000000675865 in ExecutorEnd (queryDesc=0x10910c0) at
execMain.c:439
#7 0x000000000062c71c in PortalCleanup (portal=0xf79ae0) at
portalcmds.c:280
#8 0x000000000099ce97 in PortalDrop (portal=0xf79ae0, isTopCommit=0
'\000') at portalmem.c:510
#9 0x0000000000813639 in exec_simple_query (
query_string=0xf3dbe0 "SELECT * FROM \"xxxx\".\"xxxx\" WHERE (xxxx
= 'xxxx') LIMIT 1000;") at postgres.c:1095
#10 0x00000000008177ea in PostgresMain (argc=1, argv=0xee9e10,
dbname=0xeb9a78 "xxxx", username=0xee9c78 "bernd")
at postgres.c:4072
#11 0x000000000078cc27 in BackendRun (port=0xedc700) at
postmaster.c:4342
#12 0x000000000078c377 in BackendStartup (port=0xedc700) at
postmaster.c:4016
#13 0x0000000000788983 in ServerLoop () at postmaster.c:1721
#14 0x0000000000787fb8 in PostmasterMain (argc=1, argv=0xeb7960) at
postmaster.c:1329
#15 0x00000000006d1b52 in main (argc=1, argv=0xeb7960) at main.c:228

I've tested this so far against very current REL9_6_STABLE and
REL9_5_STABLE and got them to crash with the same backtrace. The crash
is dependent on the chosen plan, experiments with work_mem show that
the crash seems to happen only if you get external sorts into the
execution plan.

REL10_STABLE seems not affected, as my extracted application query
doesn't crash there.

Running the query against REL9_6_STABLE with valgrind shows the
following results:

==00:00:01:33.067 13158== Invalid write of size 8
==00:00:01:33.067 13158== at 0x93D53B: AllocSetFree (aset.c:998)
==00:00:01:33.067 13158== by 0x93EC03: pfree (mcxt.c:1012)
==00:00:01:33.067 13158== by 0x476E34: heap_free_minimal_tuple
(heaptuple.c:1403)
==00:00:01:33.067 13158== by 0x6521A3: ExecClearTuple
(execTuples.c:455)
==00:00:01:33.067 13158== by 0x651D4B: ExecResetTupleTable
(execTuples.c:169)

==00:00:03:34.540 14381== Invalid read of size 8
==00:00:03:34.540 14381== at 0x99E29C: pfree (mcxt.c:1007)
==00:00:03:34.540 14381== by 0x4798D7: heap_free_minimal_tuple
(heaptuple.c:1403)
==00:00:03:34.540 14381== by 0x68753E: ExecClearTuple
(execTuples.c:455)
==00:00:03:34.540 14381== by 0x686EC5: ExecResetTupleTable
(execTuples.c:169)
==00:00:03:34.540 14381== by 0x67773B: ExecEndPlan (execMain.c:1469)
==00:00:03:34.540 14381== by 0x675C63: standard_ExecutorEnd
(execMain.c:468)
==00:00:03:34.540 14381== by 0x675BA1: ExecutorEnd (execMain.c:439)
==00:00:03:34.540 14381== by 0x62CA58: PortalCleanup
(portalcmds.c:280)
==00:00:03:34.540 14381== by 0x99F412: PortalDrop (portalmem.c:510)
==00:00:03:34.540 14381== by 0x8142A4: exec_simple_query
(postgres.c:1095)
==00:00:03:34.540 14381== by 0x818455: PostgresMain
(postgres.c:4072)
==00:00:03:34.540 14381== by 0x78CF63: BackendRun
(postmaster.c:4342)
==00:00:03:34.540 14381== Address 0x188cd220 is 474,080 bytes inside a
block of size 1,048,576 free'd
==00:00:03:34.540 14381== at 0x4C2FD18: free
(vg_replace_malloc.c:530)
==00:00:03:34.540 14381== by 0x99A8DB: AllocSetDelete (aset.c:653)
==00:00:03:34.540 14381== by 0x99C7A4: MemoryContextDelete
(mcxt.c:225)
==00:00:03:34.540 14381== by 0x99C855: MemoryContextDeleteChildren
(mcxt.c:245)
==00:00:03:34.540 14381== by 0x99C772: MemoryContextDelete
(mcxt.c:208)
==00:00:03:34.540 14381== by 0x9A5F10: tuplesort_end
(tuplesort.c:1198)
==00:00:03:34.540 14381== by 0x692568: ExecEndAgg (nodeAgg.c:3449)
==00:00:03:34.540 14381== by 0x67BB46: ExecEndNode
(execProcnode.c:755)
==00:00:03:34.540 14381== by 0x6AD652: ExecEndSubqueryScan
(nodeSubqueryscan.c:181)
==00:00:03:34.540 14381== by 0x67BA69: ExecEndNode
(execProcnode.c:697)
==00:00:03:34.540 14381== by 0x69DF6E: ExecEndLimit
(nodeLimit.c:438)
==00:00:03:34.540 14381== by 0x67BB9D: ExecEndNode
(execProcnode.c:779)
==00:00:03:34.540 14381== Block was alloc'd at
==00:00:03:34.540 14381== at 0x4C2EB6B: malloc
(vg_replace_malloc.c:299)
==00:00:03:34.540 14381== by 0x99AED4: AllocSetAlloc (aset.c:866)
==00:00:03:34.540 14381== by 0x99DC7A: palloc (mcxt.c:904)
==00:00:03:34.540 14381== by 0x4798F6: heap_copy_minimal_tuple
(heaptuple.c:1417)
==00:00:03:34.540 14381== by 0x6878D7: ExecCopySlotMinimalTuple
(execTuples.c:593)
==00:00:03:34.540 14381== by 0x9AC1D2: copytup_heap
(tuplesort.c:3998)
==00:00:03:34.540 14381== by 0x9A616C: tuplesort_puttupleslot
(tuplesort.c:1345)
==00:00:03:34.540 14381== by 0x68D7D8: fetch_input_tuple
(nodeAgg.c:601)
==00:00:03:34.540 14381== by 0x68FEC1: agg_retrieve_direct
(nodeAgg.c:2168)
==00:00:03:34.540 14381== by 0x68F9AB: ExecAgg (nodeAgg.c:1903)
==00:00:03:34.540 14381== by 0x67B70A: ExecProcNode
(execProcnode.c:503)
==00:00:03:34.540 14381== by 0x6AD40F: SubqueryNext
(nodeSubqueryscan.c:53)
==00:00:03:34.540 14381==
{
<insert_a_suppression_name_here>
Memcheck:Addr8
fun:pfree
fun:heap_free_minimal_tuple
fun:ExecClearTuple
fun:ExecResetTupleTable
fun:ExecEndPlan
fun:standard_ExecutorEnd
fun:ExecutorEnd
fun:PortalCleanup
fun:PortalDrop
fun:exec_simple_query
fun:PostgresMain
fun:BackendRun
}
==00:00:03:34.548 14381== Invalid read of size 4
==00:00:03:34.548 14381== at 0x99E2AE: pfree (mcxt.c:1010)
==00:00:03:34.548 14381== by 0x4798D7: heap_free_minimal_tuple
(heaptuple.c:1403)
==00:00:03:34.548 14381== by 0x68753E: ExecClearTuple
(execTuples.c:455)
==00:00:03:34.548 14381== by 0x686EC5: ExecResetTupleTable
(execTuples.c:169)
==00:00:03:34.548 14381== by 0x67773B: ExecEndPlan (execMain.c:1469)
==00:00:03:34.548 14381== by 0x675C63: standard_ExecutorEnd
(execMain.c:468)
==00:00:03:34.548 14381== by 0x675BA1: ExecutorEnd (execMain.c:439)
==00:00:03:34.548 14381== by 0x62CA58: PortalCleanup
(portalcmds.c:280)
==00:00:03:34.548 14381== by 0x99F412: PortalDrop (portalmem.c:510)
==00:00:03:34.548 14381== by 0x8142A4: exec_simple_query
(postgres.c:1095)
==00:00:03:34.548 14381== by 0x818455: PostgresMain
(postgres.c:4072)
==00:00:03:34.548 14381== by 0x78CF63: BackendRun
(postmaster.c:4342)
==00:00:03:34.548 14381== Address 0x7f7f7f7f7f7f7f7f is not stack'd,
malloc'd or (recently) free'd
==00:00:03:34.548 14381==
{
<insert_a_suppression_name_here>
Memcheck:Addr4
fun:pfree
fun:heap_free_minimal_tuple
fun:ExecClearTuple
fun:ExecResetTupleTable
fun:ExecEndPlan
fun:standard_ExecutorEnd
fun:ExecutorEnd
fun:PortalCleanup
fun:PortalDrop
fun:exec_simple_query
fun:PostgresMain
fun:BackendRun
}
==00:00:03:34.548 14381==
==00:00:03:34.548 14381== Process terminating with default action of
signal 11 (SIGSEGV): dumping core
==00:00:03:34.548 14381== General Protection Fault
==00:00:03:34.548 14381== at 0x99E2AE: pfree (mcxt.c:1010)
==00:00:03:34.548 14381== by 0x4798D7: heap_free_minimal_tuple
(heaptuple.c:1403)
==00:00:03:34.548 14381== by 0x68753E: ExecClearTuple
(execTuples.c:455)
==00:00:03:34.548 14381== by 0x686EC5: ExecResetTupleTable
(execTuples.c:169)
==00:00:03:34.548 14381== by 0x67773B: ExecEndPlan (execMain.c:1469)
==00:00:03:34.548 14381== by 0x675C63: standard_ExecutorEnd
(execMain.c:468)
==00:00:03:34.548 14381== by 0x675BA1: ExecutorEnd (execMain.c:439)
==00:00:03:34.548 14381== by 0x62CA58: PortalCleanup
(portalcmds.c:280)
==00:00:03:34.548 14381== by 0x99F412: PortalDrop (portalmem.c:510)
==00:00:03:34.548 14381== by 0x8142A4: exec_simple_query
(postgres.c:1095)
==00:00:03:34.548 14381== by 0x818455: PostgresMain
(postgres.c:4072)
==00:00:03:34.548 14381== by 0x78CF63: BackendRun
(postmaster.c:4342)
==00:00:03:35.088 14381==
==00:00:03:35.088 14381== Process terminating with default action of
signal 11 (SIGSEGV)
==00:00:03:35.088 14381== General Protection Fault
==00:00:03:35.088 14381== at 0x632016C: _dl_catch_error (in
/usr/lib64/libc-2.25.so)
==00:00:03:35.088 14381== by 0x631F8E6: __libc_dlclose (in
/usr/lib64/libc-2.25.so)
==00:00:03:35.088 14381== by 0x634B5E4: free_mem (in
/usr/lib64/libc-2.25.so)
==00:00:03:35.088 14381== by 0x634B1E1: __libc_freeres (in
/usr/lib64/libc-2.25.so)
==00:00:03:35.088 14381== by 0x4A296DB: _vgnU_freeres
(vg_preloaded.c:77)
==00:00:03:35.088 14381== by 0x18118EA7: ???
==00:00:03:35.088 14381== by 0x4798D7: heap_free_minimal_tuple
(heaptuple.c:1403)
==00:00:03:35.088 14381== by 0x68753E: ExecClearTuple
(execTuples.c:455)
==00:00:03:35.088 14381== by 0x686EC5: ExecResetTupleTable
(execTuples.c:169)
==00:00:03:35.088 14381== by 0x67773B: ExecEndPlan (execMain.c:1469)
==00:00:03:35.088 14381== by 0x675C63: standard_ExecutorEnd
(execMain.c:468)
==00:00:03:35.088 14381== by 0x675BA1: ExecutorEnd (execMain.c:439)
==00:00:03:35.088 14381==
==00:00:03:35.088 14381== HEAP SUMMARY:
==00:00:03:35.088 14381== in use at exit: 6,833,625 bytes in 531
blocks
==00:00:03:35.088 14381== total heap usage: 814,111 allocs, 1,264
frees, 118,761,978 bytes allocated
==00:00:03:35.088 14381==
==00:00:03:35.088 14381== For a detailed leak analysis, rerun with: --
leak-check=full
==00:00:03:35.088 14381==
==00:00:03:35.088 14381== For counts of detected and suppressed errors,
rerun with: -v
==00:00:03:35.088 14381== ERROR SUMMARY: 2 errors from 2 contexts
(suppressed: 35 from 6)

It seems the backend tries to free a minimal tuple at executor end,
which is already gone by deleting the memory context it was allocated
in before. ExecResetTupleTable() is looping through a list with 70
entries, it encounters (after 6 or seven rounds) the first tuple slot
with tts_shouldFreeMin set, all others before don't have it set:

(gdb) p *slot
$6 = {type = T_TupleTableSlot, tts_isempty = 0 '\000', tts_shouldFree =
0 '\000', tts_shouldFreeMin = 1 '\001', tts_slow = 1 '\001',
tts_tuple = 0x10a6c48, tts_tupleDescriptor = 0x10d1be8, tts_mcxt =
0xf59668, tts_buffer = 0, tts_nvalid = 6, tts_values = 0x10d2640,
tts_isnull = 0x10d26d8 "", tts_mintuple = 0x171c168, tts_minhdr =
{t_len = 162, t_self = {ip_blkid = {bi_hi = 0, bi_lo = 0},
ip_posid = 0}, t_tableOid = 0, t_data = 0x171c160}, tts_off = 88}
(gdb) p list_length(tupleTable)

tts_mintuple is invalid, though:

(gdb) p *slot->tts_mintuple
Cannot access memory at address 0x171c168

The following attempt to clean it in ExecClearTuple() then crashes the
backend.

Bernd

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message Peter Eisentraut 2017-12-07 17:13:53 Re: BUG #14952: COPY fails to fill in IDENTITY column default value
Previous Message Tom Lane 2017-12-07 15:27:24 Re: BUG #14953: pg_blocking_pids / array_length inconsistency

Browse pgsql-hackers by date

  From Date Subject
Next Message Alvaro Herrera 2017-12-07 16:02:32 Re: Speeding up pg_upgrade
Previous Message Stephen Frost 2017-12-07 15:37:30 Re: Speeding up pg_upgrade