Quick Links

Re: dsa_allocate() faliure

From:	Sand Stone <sand(dot)m(dot)stone(at)gmail(dot)com>
To:	Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>
Cc:	Rick Otten <rottenwindfish(at)gmail(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-performance(at)lists(dot)postgresql(dot)org, Robert Haas <robertmhaas(at)gmail(dot)com>
Subject:	Re: dsa_allocate() faliure
Date:	2018-05-23 14:06:41
Message-ID:	CADrk5qMoyhPcRqUBO+SCRsnc_mJG_z0fK5HA2zb=Lnouxar4aw@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers pgsql-performance

>> At which commit ID?
83fcc615020647268bb129cbf86f7661feee6412 (5/6)

>>do you mean that these were separate PostgreSQL clusters, and they were all running the same query and they all crashed like this?
A few worker nodes, a table is hash partitioned by "aTable.did" by
Citus, and further partitioned by PG10 by time range on field "ts". As
far as I could tell, Citus just does a query rewrite, and execute the
same type of queries to all nodes.

>>so this happened at the same time or at different times?
At the same time. The queries are simple count and sum queries, here
is the relevant part from one of the worker nodes:
2018-05-23 01:24:01.492 UTC [130536] ERROR: dsa_allocate could not
find 7 free pages
2018-05-23 01:24:01.492 UTC [130536] CONTEXT: parallel worker
STATEMENT: COPY (SELECT count(1) AS count, sum(worker_column_1) AS
sum FROM (SELECT subquery.avg AS worker_column_1 FROM (SELECT
aTable.did, avg((aTable.sum OPERATOR(pg_catalog./)
(aTable.count)::double precision)) AS avg FROM public.aTable_102117
aTable WHERE ((aTable.ts OPERATOR(pg_catalog.>=) '2018-04-25
00:00:00+00'::timestamp with time zone) AND (aTable.ts
OPERATOR(pg_catalog.<=) '2018-04-30 00:00:00+00'::timestamp with time
zone) AND (aTable.v OPERATOR(pg_catalog.=) 12345)) GROUP BY
aTable.did) subquery) worker_subquery) TO STDOUT WITH (FORMAT binary)

>> a parallel worker process
I think this is more of PG10 parallel bg worker issue. I don't think
Citus just lets each worker PG server do its own planning.

I will try to do more experiments about this, and see if there is any
specific query to cause the parallel query execution to fail. As far
as I can tell, the level of concurrency triggered this issue. That is
executing 10s of queries as shown on the worker nodes, depending on
the stats, the PG10 core may or may not spawn more bg workers.

Thanks for your time!

On Tue, May 22, 2018 at 9:44 PM, Thomas Munro
<thomas(dot)munro(at)enterprisedb(dot)com> wrote:
> On Wed, May 23, 2018 at 4:10 PM, Sand Stone <sand(dot)m(dot)stone(at)gmail(dot)com> wrote:
>>>>dsa_allocate could not find 7 free pages
>> I just this error message again on all of my worker nodes (I am using
>> Citus 7.4 rel). The PG core is my own build of release_10_stable
>> (10.4) out of GitHub on Ubuntu.
>
> At which commit ID?
>
> All of your worker nodes... so this happened at the same time or at
> different times? I don't know much about Citus -- do you mean that
> these were separate PostgreSQL clusters, and they were all running the
> same query and they all crashed like this?
>
>> What's the best way to debug this? I am running pre-production tests
>> for the next few days, so I could gather info. if necessary (I cannot
>> pinpoint a query to repro this yet, as we have 10K queries running
>> concurrently).
>
> Any chance of an EXPLAIN plan for the query that crashed like this?
> Do you know if it's using multiple Gather[Merge] nodes and parallel
> bitmap heap scans? Was it a regular backend process or a parallel
> worker process (or a Citus worker process, if that is a thing?) that
> raised the error?
>
> --
> Thomas Munro
> http://www.enterprisedb.com

In response to

Re: dsa_allocate() faliure at 2018-05-23 04:44:25 from Thomas Munro

Responses

Re: dsa_allocate() faliure at 2018-08-15 20:32:45 from Sand Stone

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Tom Lane	2018-05-23 14:08:33	Re: -D option of pg_resetwal is only works with absolute path
Previous Message	Jeremy Finzel	2018-05-23 13:52:13	Re: found xmin from before relfrozenxid on pg_catalog.pg_authid

Browse pgsql-performance by date

	From	Date	Subject
Next Message	Justin Pryzby	2018-05-23 20:01:32	Re: Help me in reducing the CPU cost for the high cost query below, as it is hitting production seriously!!
Previous Message	pavan95	2018-05-23 14:03:18	Re: Help me in reducing the CPU cost for the high cost query below, as it is hitting production seriously!!