Re: BUG #15225: [XX000] ERROR: invalid DSA memory alloc request size 1073741824 / Where: parallel worker

From: Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>
To: Frits Jalvingh <jal(at)etc(dot)to>
Cc: pgsql-bugs(at)lists(dot)postgresql(dot)org
Subject: Re: BUG #15225: [XX000] ERROR: invalid DSA memory alloc request size 1073741824 / Where: parallel worker
Date: 2018-06-05 21:36:37
Message-ID: CAEepm=1nemj1wGLsnNrVqBLmqV9N6+6WToo=prGxAXafPYiTrA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

On Tue, Jun 5, 2018 at 11:08 PM, Frits Jalvingh <jal(at)etc(dot)to> wrote:
> Hi Thomas, thanks for your help.
>
> The problem seems to be not very consistent. Without any changes to either
> database or query I now get:
> org.postgresql.util.PSQLException: ERROR: could not resize shared memory
> segment "/PostgreSQL.347733883" to 2147483648 bytes: No space left on device

That's what it looks like when the OS just can't allocate any more
virtual memory (posix_fallocate() on a shared memory region failed).
The solution to that problem is to turn down work_mem. With our
current model of memory management, we'll use work_mem for every
backend for every sort or hash join. So when you said work_mem = 2GB
and max_parallel_workers = 2 (+ 1 leader process = 3 backends) and you
joined N tables together (that's a very hand wavy and stupid
approximation of the number of nodes), you said it was OK to use a
total of 2GB * (N - 1) * 3. That's apparently too much for this rig.
Does it work if you turn work_mem down?

> Just executing the statement WITHOUT the explain does work properly, so it
> is the explain part that seems to trigger the issue.

That is weird. I don't have a theory to explain that yet.

> I set enable_parallel_hash = off in the code and now the explain part works
> properly. The plan XML looks like this:

> <Node-Type>Sort</Node-Type>
> <Plan-Rows>1462</Plan-Rows>
> <Actual-Rows>184162324</Actual-Rows>

> <Node-Type>Hash Join</Node-Type>
> <Plan-Rows>1462</Plan-Rows>
> <Actual-Rows>184162324</Actual-Rows>

> <Node-Type>Hash</Node-Type>
> <Plan-Rows>47373</Plan-Rows>
> <Actual-Rows>731320569</Actual-Rows>

Thanks. Hmm. I didn't try to understand this plan in detail but
clearly it is dramatically underestimating cardinality, and there are
large numbers involved that would result in nbuckets of the right sort
of order.

If the first error you reported results from an attempt to allocate a
very large bucket array (rather than, say, a monster tuple >= 1GB), as
I speculated, then there would need to be a flaw in the logic that
clamps nbuckets to fit in MaxAllocSize. I will do some digging today.

--
Thomas Munro
http://www.enterprisedb.com

In response to

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message Fabien COELHO 2018-06-06 02:51:23 RE: BUG #15228: pgbench custom script numbering off-by-one
Previous Message Tom Lane 2018-06-05 17:18:00 Re: Warning when using psql