Re: ERROR: too many dynamic shared memory segments

From: Jakub Glapa <jakub(dot)glapa(at)gmail(dot)com>
To: Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>
Cc: pgsql-general(at)lists(dot)postgresql(dot)org
Subject: Re: ERROR: too many dynamic shared memory segments
Date: 2017-11-27 12:13:32
Message-ID: CAJk1zg2XqYa1Dx14z9X3qkUgWH6kpoNbvSGT_=NtzDPnMhfFHg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general pgsql-hackers

Hi Thomas,
log excerpt:

...
2017-11-27 12:21:14 CET:192.168.10.83(33424):user(at)db:[27291]: ERROR: too
many dynamic shared memory segments
2017-11-27 12:21:14 CET:192.168.10.83(33424):user(at)db:[27291]: STATEMENT:
SELECT << REMOVED>>
2017-11-27 12:21:14 CET:192.168.10.83(35182):user(at)db:[28281]: ERROR: too
many dynamic shared memory segments
2017-11-27 12:21:14 CET:192.168.10.83(35182):user(at)db:[28281]: STATEMENT:
SELECT <<REMOVED>
2017-11-27 12:21:14 CET::@:[28618]: ERROR: could not map dynamic shared
memory segment
2017-11-27 12:21:14 CET::@:[28619]: ERROR: could not map dynamic shared
memory segment
2017-11-27 12:21:14 CET::@:[25645]: LOG: worker process: parallel worker
for PID 27291 (PID 28618) exited with exit code 1
2017-11-27 12:21:14 CET::@:[25645]: LOG: worker process: parallel worker
for PID 27291 (PID 28619) exited with exit code 1
...

this time the db didn't crash but the queries failed to execute.

The queries are somehow special.
We are still using the old style partitioning (list type) but we abuse it a
bit when querying.
When querying a set of partitions instead of doing it via parent table we
stitch together the required tables with UNION ALL (this requires less
locking) and was more performant in our benchmark (the new native
partitioning might improve this but we didn't research that yet).

The queries are in form of
SELECT col1,col2,col3 FROM
(SELECT *
FROM par1
WHERE rangestart >= '2017-01-01 00:00:00.0' AND rangeend <=
'2017-11-26 23:59:59.999'
UNION ALL SELECT *
FROM par2
WHERE rangestart >= '2017-01-01 00:00:00.0' AND rangeend <=
'2017-11-26 23:59:59.999'
UNION ALL SELECT *
FROM par2
WHERE rangestart >= '2017-01-01 00:00:00.0' AND rangeend <=
'2017-11-26 23:59:59.999'

WHERE rangestart >= '2017-01-01 00:00:00.0' AND rangeend <=
'2017-11-26 23:59:59.999'

UNION ALL
...
) unionalias
WHERE ....

and indeed the query planner shows the usage of Bitmap Heap Scan
....
-> Subquery Scan on "*SELECT* 2"
(cost=3068.58..19793.94 rows=1 width=1284)
-> Gather (cost=3068.58..19793.93
rows=1 width=5014)
Workers Planned: 2
-> Parallel Bitmap Heap Scan
on par_6 (cost=2068.58..18793.83 rows=1 width=5014)
Recheck Cond:
<<CONDITION>>
Filter: <<CONDITION>>
-> BitmapAnd
(cost=2068.58..2068.58 rows=30955 width=0)
-> BitmapOr
(cost=999.30..999.30 rows=42989 width=0)
-> Bitmap
Index Scan on par_6_datasource (cost=0.00..990.21 rows=42922 width=0)
Index
Cond: ((datasource)::text = 'one'::text)
-> Bitmap
Index Scan on par_6_datasource (cost=0.00..4.30 rows=1 width=0)
Index
Cond: ((datasource)::text = 'two'::text)
-> Bitmap
Index Scan on par_6_datasource (cost=0.00..4.79 rows=67 width=0)
Index
Cond: ((datasource)::text = 'three'::text)
-> Bitmap Index
Scan on par_6_rangestart (cost=0.00..1069.02 rows=47564 width=0)
Index Cond:
(rangestart >= '2017-01-01 00:00:00+01'::timestamp with time zone)
-> Subquery Scan on "*SELECT* 3"
(cost=761.33..7944.99 rows=1 width=1086)
-> Bitmap Heap Scan on par_7
(cost=761.33..7944.98 rows=1 width=4816)
Recheck Cond:
<<CONDITION>>
Filter: <<CONDITION>>
-> BitmapAnd
(cost=761.33..761.33 rows=7045 width=0)
-> Bitmap Index Scan on
par_7_rangestart (cost=0.00..380.35 rows=14942 width=0)
Index Cond:
(rangestart >= '2017-01-01 00:00:00+01'::timestamp with time zone)
-> BitmapOr
(cost=380.72..380.72 rows=12248 width=0)
-> Bitmap Index
Scan on par_7_datasource (cost=0.00..372.00 rows=12228 width=0)
Index Cond:
((datasource)::text = 'one'::text)
-> Bitmap Index
Scan on par_7_datasource (cost=0.00..4.36 rows=10 width=0)
Index Cond:
((datasource)::text = 'two'::text)
-> Bitmap Index
Scan on par_7_datasource (cost=0.00..4.36 rows=10 width=0)
Index Cond:
((datasource)::text = 'three'::text)

....

In this particular query there were over _100_ partitions connected with
the UNION ALL operator.

--
regards,
pozdrawiam,
Jakub Glapa

On Mon, Nov 27, 2017 at 11:47 AM, Thomas Munro <
thomas(dot)munro(at)enterprisedb(dot)com> wrote:

> On Mon, Nov 27, 2017 at 10:54 PM, Jakub Glapa <jakub(dot)glapa(at)gmail(dot)com>
> wrote:
> > The DB enters recovery mode after that.
>
> That's not good. So it actually crashes? Can you please show the
> full error messages?
>
> > 2017-11-23 07:20:39 CET::@:[24823]: ERROR: could not attach to dynamic
> > shared area
>
> From src/backend/utils/mmgr/dsa.c. The same message is used for
> dsa_attach() and for attach_internal(), but in this case it must be
> the latter because we use in-place DSA areas for parallel query. This
> means that when the worker tried to attach it found that
> control->refcnt == 0, meaning 'too late, every other backend including
> the leader has already detached'.
>
> > 2017-11-23 07:20:39 CET::@:[24822]: ERROR: could not map dynamic shared
> > memory segment
>
> From src/backend/access/transam/parallel.c when dsm_attach returns
> null. I think this also means 'too late'.
>
> So those are both workers that have started up and found that the
> leader has abandoned the parallel query already, but they discovered
> it at different stages. PID 24823 didn't manage to attach to the DSM
> segment, while PID 24822 did but found that the other(s) had already
> detached from the per-query DSA area inside it.
>
> > 2017-11-23 07:20:40 CET:192.168.xx,xx(33974):u(at)db:[24209]: ERROR:
> too
> > many
> > dynamic shared memory segments
> >
> > The errors happen when the parallel execution is enabled and multiple
> > queries are executed simultaneously.
> > If I set the max_parallel_workers_per_gather = 0 the error doesn't
> occur.
> > The queries are rather big, each executes over 10sec.
> > I see the error when 4 or more queries are started at the same time.
> >
> > my postgresql.conf:
> >
> > max_connections = 100
>
> So the two questions we need to figure out are: (1) how do we manage
> to use up all 64 + 2 * 100 DSM slots (or somehow corrupt things so it
> looks that way) by running only 4 queries, and (2) why would be be
> crashing rather than failing and reporting an error? I'm not able to
> reproduce the problem from your description running lots of parallel
> queries running at the same time. Can you provide reproduction steps?
> Does your query plan happen to include a Parallel Bitmap Heap Scan?
>
> --
> Thomas Munro
> http://www.enterprisedb.com
>

In response to

Responses

Browse pgsql-general by date

  From Date Subject
Next Message PeterS 2017-11-27 13:06:33 Install numpy to use within plpython3u in postgres 9.6 for Windows 2012 (64 bit)
Previous Message Thomas Munro 2017-11-27 10:47:05 Re: ERROR: too many dynamic shared memory segments

Browse pgsql-hackers by date

  From Date Subject
Next Message Юрий Соколов 2017-11-27 12:41:46 Re: [HACKERS] Fix performance degradation of contended LWLock on NUMA
Previous Message Oliver Ford 2017-11-27 12:06:35 Re: Add RANGE with values and exclusions clauses to the Window Functions