Re: SegFault on 9.6.14

From: Jerry Sievers <gsievers19(at)comcast(dot)net>
To: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
Cc: Jerry Sievers <gsievers19(at)comcast(dot)net>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: SegFault on 9.6.14
Date: 2019-07-16 23:05:44
Message-ID: 87blxta8c7.fsf@jsievers.enova.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com> writes:

> On Mon, Jul 15, 2019 at 08:20:00PM -0500, Jerry Sievers wrote:
>
>>Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com> writes:
>>
>>> On Mon, Jul 15, 2019 at 07:22:55PM -0500, Jerry Sievers wrote:
>>>
>>>>Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com> writes:
>>>>
>>>>> On Mon, Jul 15, 2019 at 06:48:05PM -0500, Jerry Sievers wrote:
>>>>>
>>>>>>Greetings Hackers.
>>>>>>
>>>>>>We have a reproduceable case of $subject that issues a backtrace such as
>>>>>>seen below.
>>>>>>
>>>>>>The query that I'd prefer to sanitize before sending is <30 lines of at
>>>>>>a glance, not terribly complex logic.
>>>>>>
>>>>>>It nonetheless dies hard after a few seconds of running and as expected,
>>>>>>results in an automatic all-backend restart.
>>>>>>
>>>>>>Please advise on how to proceed. Thanks!
>>>>>>
>>>>>>bt
>>>>>>#0 initscan (scan=scan(at)entry=0x55d7a7daa0b0, key=0x0, keep_startblock=keep_startblock(at)entry=1 '\001')
>>>>>> at /build/postgresql-9.6-5O8OLM/postgresql-9.6-9.6.14/build/../src/backend/access/heap/heapam.c:233
>>>>>>#1 0x000055d7a72fa8d0 in heap_rescan (scan=0x55d7a7daa0b0, key=key(at)entry=0x0) at /build/postgresql-9.6-5O8OLM/postgresql-9.6-9.6.14/build/../src/backend/access/heap/heapam.c:1529
>>>>>>#2 0x000055d7a7451fef in ExecReScanSeqScan (node=node(at)entry=0x55d7a7d85100) at /build/postgresql-9.6-5O8OLM/postgresql-9.6-9.6.14/build/../src/backend/executor/nodeSeqscan.c:280
>>>>>>#3 0x000055d7a742d36e in ExecReScan (node=0x55d7a7d85100) at /build/postgresql-9.6-5O8OLM/postgresql-9.6-9.6.14/build/../src/backend/executor/execAmi.c:158
>>>>>>#4 0x000055d7a7445d38 in ExecReScanGather (node=node(at)entry=0x55d7a7d84d30) at /build/postgresql-9.6-5O8OLM/postgresql-9.6-9.6.14/build/../src/backend/executor/nodeGather.c:475
>>>>>>#5 0x000055d7a742d255 in ExecReScan (node=0x55d7a7d84d30) at /build/postgresql-9.6-5O8OLM/postgresql-9.6-9.6.14/build/../src/backend/executor/execAmi.c:166
>>>>>>#6 0x000055d7a7448673 in ExecReScanHashJoin (node=node(at)entry=0x55d7a7d84110) at /build/postgresql-9.6-5O8OLM/postgresql-9.6-9.6.14/build/../src/backend/executor/nodeHashjoin.c:1019
>>>>>>#7 0x000055d7a742d29e in ExecReScan (node=node(at)entry=0x55d7a7d84110) at /build/postgresql-9.6-5O8OLM/postgresql-9.6-9.6.14/build/../src/backend/executor/execAmi.c:226
>>>>>><about 30 lines omitted>
>>>>>>
>>>>>
>>>>> Hmmm, that means it's crashing here:
>>>>>
>>>>> if (scan->rs_parallel != NULL)
>>>>> scan->rs_nblocks = scan->rs_parallel->phs_nblocks; <--- here
>>>>> else
>>>>> scan->rs_nblocks = RelationGetNumberOfBlocks(scan->rs_rd);
>>>>>
>>>>> But clearly, scan is valid (otherwise it'd crash on the if condition),
>>>>> and scan->rs_parallel must me non-NULL. Which probably means the pointer
>>>>> is (no longer) valid.
>>>>>
>>>>> Could it be that the rs_parallel DSM disappears on rescan, or something
>>>>> like that?
>>>>
>>>>No clue but something I just tried was to disable parallelism by setting
>>>>max_parallel_workers_per_gather to 0 and however the query has not
>>>>finished after a few minutes, there is no crash.
>>>>
>>>
>>> That might be a hint my rough analysis was somewhat correct. The
>>> question is whether the non-parallel plan does the same thing. Maybe it
>>> picks a plan that does not require rescans, or something like that.
>>>
>>>>Please advise.
>>>>
>>>
>>> It would be useful to see (a) exacution plan of the query, (b) full
>>> backtrace and (c) a bit of context for the place where it crashed.
>>>
>>> Something like (in gdb):
>>>
>>> bt full
>>> list
>>> p *scan
>>
>>The p *scan did nothing unless I ran it first however my gdb $foo isn't
>>strong presently.
>
> Hmm, the rs_parallel pointer looks sane (it's not obvious garbage). Can
> you try this?
>
> p *scan->rs_parallel

$ gdb /usr/lib/postgresql/9.6/bin/postgres core
GNU gdb (Ubuntu 7.11.1-0ubuntu1~16.5) 7.11.1
Copyright (C) 2016 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law. Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
<http://www.gnu.org/software/gdb/documentation/>.
For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from /usr/lib/postgresql/9.6/bin/postgres...Reading symbols from /usr/lib/debug/.build-id/04/6f55a5ce6ce05064edfc8feee61c6cb039d296.debug...done.
done.
[New LWP 31654]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
Core was generated by `postgres: foo_eis_segfault: jsievers staging 10.220.22.26(57948) SELECT '.
Program terminated with signal SIGSEGV, Segmentation fault.
#0 initscan (scan=scan(at)entry=0x55d7a7daa0b0, key=0x0, keep_startblock=keep_startblock(at)entry=1 '\001')
at /build/postgresql-9.6-5O8OLM/postgresql-9.6-9.6.14/build/../src/backend/access/heap/heapam.c:233
233 /build/postgresql-9.6-5O8OLM/postgresql-9.6-9.6.14/build/../src/backend/access/heap/heapam.c: No such file or directory.
(gdb) p *scan->rs_parallel
Cannot access memory at address 0x7fa673a54108
(gdb)

>
> Another question - are you sure this is not an OOM issue? That might
> sometimes look like SIGSEGV due to overcommit. What's the memory
> consumption / is there anything in dmesg?

Below is all I got after a prior dmesg -c...

dmesg -c
[5441294.442062] postgres[12033]: segfault at 7f3d011d2110 ip 000055666def9a31 sp 00007ffc37be9a70 error 4 in postgres[55666de23000+653000]

Thanks!

>
> regards

--
Jerry Sievers
Postgres DBA/Development Consulting
e: postgres(dot)consulting(at)comcast(dot)net

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2019-07-16 23:06:51 Re: POC: converting Lists into arrays
Previous Message Andres Freund 2019-07-16 22:46:49 Re: Allow simplehash to use already-calculated hash values