Re: ERROR: too many dynamic shared memory segments

From: Jakub Glapa <jakub(dot)glapa(at)gmail(dot)com>
To: Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>
Cc: Forums postgresql <pgsql-general(at)postgresql(dot)org>
Subject: Re: ERROR: too many dynamic shared memory segments
Date: 2017-12-04 12:18:38
Message-ID: CAJk1zg01hqzWdtiXzUEmGkZM0Cgh8dUnSYf-SJY8juKarj-UWA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general pgsql-hackers

I see that the segfault is under active discussion but just wanted to ask
if increasing the max_connections to mitigate the DSM slots shortage is the
way to go?

--
regards,
Jakub Glapa

On Mon, Nov 27, 2017 at 11:48 PM, Thomas Munro <
thomas(dot)munro(at)enterprisedb(dot)com> wrote:

> On Tue, Nov 28, 2017 at 10:05 AM, Jakub Glapa <jakub(dot)glapa(at)gmail(dot)com>
> wrote:
> > As for the crash. I dug up the initial log and it looks like a
> segmentation
> > fault...
> >
> > 2017-11-23 07:26:53 CET:192.168.10.83(35238):user(at)db:[30003]: ERROR:
> too
> > many dynamic shared memory segments
>
> Hmm. Well this error can only occur in dsm_create() called without
> DSM_CREATE_NULL_IF_MAXSEGMENTS. parallel.c calls it with that flag
> and dsa.c doesn't (perhaps it should, not sure, but that'd just change
> the error message), so that means this the error arose from dsa.c
> trying to get more segments. That would be when Parallel Bitmap Heap
> Scan tried to allocate memory.
>
> I hacked my copy of PostgreSQL so that it allows only 5 DSM slots and
> managed to reproduce a segv crash by trying to run concurrent Parallel
> Bitmap Heap Scans. The stack looks like this:
>
> * frame #0: 0x00000001083ace29
> postgres`alloc_object(area=0x0000000000000000, size_class=10) + 25 at
> dsa.c:1433
> frame #1: 0x00000001083acd14
> postgres`dsa_allocate_extended(area=0x0000000000000000, size=72,
> flags=4) + 1076 at dsa.c:785
> frame #2: 0x0000000108059c33
> postgres`tbm_prepare_shared_iterate(tbm=0x00007f9743027660) + 67 at
> tidbitmap.c:780
> frame #3: 0x0000000108000d57
> postgres`BitmapHeapNext(node=0x00007f9743019c88) + 503 at
> nodeBitmapHeapscan.c:156
> frame #4: 0x0000000107fefc5b
> postgres`ExecScanFetch(node=0x00007f9743019c88,
> accessMtd=(postgres`BitmapHeapNext at nodeBitmapHeapscan.c:77),
> recheckMtd=(postgres`BitmapHeapRecheck at nodeBitmapHeapscan.c:710)) +
> 459 at execScan.c:95
> frame #5: 0x0000000107fef983
> postgres`ExecScan(node=0x00007f9743019c88,
> accessMtd=(postgres`BitmapHeapNext at nodeBitmapHeapscan.c:77),
> recheckMtd=(postgres`BitmapHeapRecheck at nodeBitmapHeapscan.c:710)) +
> 147 at execScan.c:162
> frame #6: 0x00000001080008d1
> postgres`ExecBitmapHeapScan(pstate=0x00007f9743019c88) + 49 at
> nodeBitmapHeapscan.c:735
>
> (lldb) f 3
> frame #3: 0x0000000108000d57
> postgres`BitmapHeapNext(node=0x00007f9743019c88) + 503 at
> nodeBitmapHeapscan.c:156
> 153 * dsa_pointer of the iterator state which will be used by
> 154 * multiple processes to iterate jointly.
> 155 */
> -> 156 pstate->tbmiterator = tbm_prepare_shared_iterate(tbm);
> 157 #ifdef USE_PREFETCH
> 158 if (node->prefetch_maximum > 0)
> 159
> (lldb) print tbm->dsa
> (dsa_area *) $3 = 0x0000000000000000
> (lldb) print node->ss.ps.state->es_query_dsa
> (dsa_area *) $5 = 0x0000000000000000
> (lldb) f 17
> frame #17: 0x000000010800363b
> postgres`ExecGather(pstate=0x00007f9743019320) + 635 at
> nodeGather.c:220
> 217 * Get next tuple, either from one of our workers, or by running the
> plan
> 218 * ourselves.
> 219 */
> -> 220 slot = gather_getnext(node);
> 221 if (TupIsNull(slot))
> 222 return NULL;
> 223
> (lldb) print *node->pei
> (ParallelExecutorInfo) $8 = {
> planstate = 0x00007f9743019640
> pcxt = 0x00007f97450001b8
> buffer_usage = 0x0000000108b7e218
> instrumentation = 0x0000000108b7da38
> area = 0x0000000000000000
> param_exec = 0
> finished = '\0'
> tqueue = 0x0000000000000000
> reader = 0x0000000000000000
> }
> (lldb) print *node->pei->pcxt
> warning: could not load any Objective-C class information. This will
> significantly reduce the quality of type information available.
> (ParallelContext) $9 = {
> node = {
> prev = 0x000000010855fb60
> next = 0x000000010855fb60
> }
> subid = 1
> nworkers = 0
> nworkers_launched = 0
> library_name = 0x00007f9745000248 "postgres"
> function_name = 0x00007f9745000268 "ParallelQueryMain"
> error_context_stack = 0x0000000000000000
> estimator = (space_for_chunks = 180352, number_of_keys = 19)
> seg = 0x0000000000000000
> private_memory = 0x0000000108b53038
> toc = 0x0000000108b53038
> worker = 0x0000000000000000
> }
>
> I think there are two failure modes: one of your sessions showed the
> "too many ..." error (that's good, ran out of slots and said so and
> our error machinery worked as it should), and another crashed with a
> segfault, because it tried to use a NULL "area" pointer (bad). I
> think this is a degenerate case where we completely failed to launch
> parallel query, but we ran the parallel query plan anyway and this
> code thinks that the DSA is available. Oops.
>
> --
> Thomas Munro
> http://www.enterprisedb.com
>

In response to

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Michael Paquier 2017-12-04 12:40:54 Re: Replication causing publisher node to use excessive cpu over time
Previous Message rob stone 2017-12-04 12:07:02 Re: ISO8601 vs POSIX offset clarification

Browse pgsql-hackers by date

  From Date Subject
Next Message Justin Pryzby 2017-12-04 13:09:26 Re: Doc tweak for huge_pages?
Previous Message Raúl Marín Rodríguez 2017-12-04 10:06:07 Re: [HACKERS] pow support for pgbench