RE: dsa_allocate() faliure

From: Arne Roland <A(dot)Roland(at)index(dot)de>
To: Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, Sand Stone <sand(dot)m(dot)stone(at)gmail(dot)com>
Cc: Rick Otten <rottenwindfish(at)gmail(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, "pgsql-performance(at)lists(dot)postgresql(dot)org" <pgsql-performance(at)lists(dot)postgresql(dot)org>, Robert Haas <robertmhaas(at)gmail(dot)com>
Subject: RE: dsa_allocate() faliure
Date: 2019-01-24 14:44:41
Message-ID: 6f3fe9fa5a984dc19e40e79fbef45edc@index.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers pgsql-performance

Hello,

I'm not sure whether this is connected at all, but I'm facing the same error with a generated query on postgres 10.6.
It works with parallel query disabled and gives "dsa_allocate could not find 7 free pages" otherwise.

I've attached query and strace. The table is partitioned on (o, date). It's not depended on the precise lists I'm using, while it obviously does depend on the fact that the optimizer chooses a parallel query.

Regards
Arne Roland

-----Original Message-----
From: Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>
Sent: Friday, October 5, 2018 4:17 AM
To: Sand Stone <sand(dot)m(dot)stone(at)gmail(dot)com>
Cc: Rick Otten <rottenwindfish(at)gmail(dot)com>; Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>; pgsql-performance(at)lists(dot)postgresql(dot)org; Robert Haas <robertmhaas(at)gmail(dot)com>
Subject: Re: dsa_allocate() faliure

On Wed, Aug 29, 2018 at 5:48 PM Sand Stone <sand(dot)m(dot)stone(at)gmail(dot)com> wrote:
> I attached a query (and its query plan) that caused the crash: "dsa_allocate could not find 13 free pages" on one of the worker nodes. I anonymised the query text a bit. Interestingly, this time only one (same one) of the nodes is crashing. Since this is a production environment, I cannot get the stack trace. Once turned off parallel execution for this node. The whole query finished just fine. So the parallel query plan is from one of the nodes not crashed, hopefully the same plan would have been executed on the crashed node. In theory, every worker node has the same bits, and very similar data.

I wonder if this was a different symptom of the problem fixed here:

https://www.postgresql.org/message-id/flat/194c0706-c65b-7d81-ab32-2c248c3e2344%402ndquadrant.com

Can you still reproduce it on current master, REL_11_STABLE or REL_10_STABLE?

--
Thomas Munro
http://www.enterprisedb.com

Attachment Content-Type Size
strace.log application/octet-stream 435.5 KB
query.sql application/octet-stream 141.1 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Bruce Momjian 2019-01-24 14:47:32 Re: Protect syscache from bloating with negative cache entries
Previous Message Tom Lane 2019-01-24 14:37:41 Re: Use an enum for RELKIND_*?

Browse pgsql-performance by date

  From Date Subject
Next Message Jan Nielsen 2019-01-24 16:52:03 Re: SELECT performance drop
Previous Message Mariel Cherkassky 2019-01-24 14:14:21 Re: ERROR: found xmin from before relfrozenxid