Re: dsa_allocate() faliure

From: Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>
To: Sand Stone <sand(dot)m(dot)stone(at)gmail(dot)com>
Cc: Rick Otten <rottenwindfish(at)gmail(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-performance(at)lists(dot)postgresql(dot)org, Robert Haas <robertmhaas(at)gmail(dot)com>
Subject: Re: dsa_allocate() faliure
Date: 2018-08-15 22:42:25
Message-ID: CAEepm=1D_RP1OV=5_mF6d4hiNGGy4fzyaXp3=e4wXLBup20a9g@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers pgsql-performance

On Thu, Aug 16, 2018 at 8:32 AM, Sand Stone <sand(dot)m(dot)stone(at)gmail(dot)com> wrote:
> Just as a follow up. I tried the parallel execution again (in a stress
> test environment). Now the crash seems gone. I will keep an eye on
> this for the next few weeks.

Thanks for the report. That's great news, but it'd be good to
understand why it was happening.

> My theory is that the Citus cluster created and shut down a lot of TCP
> connections between coordinator and workers. If running on untuned
> Linux machines, the TCP ports might run out.

I'm not sure how that's relevant, unless perhaps it causes executor
nodes to be invoked in a strange sequence that commit fd7c0fa7 didn't
fix? I wonder if there could be something different about the control
flow with custom scans, or something about the way Citus worker nodes
invoke plan fragments, or some error path that I failed to consider...
It's a clue that all of your worker nodes reliably crashed at the same
time on the same/similar queries (presumably distributed query
fragments for different shards), making it seem more like a
common-or-garden bug rather than some kind of timing-based heisenbug.
If you ever manage to reproduce it, an explain plan and a back trace
would be very useful.

> Of course, I am using "newer" PG10 bits and Citus7.5 this time.

Hmm. There weren't any relevant commits to REL_10_STABLE that I can
think of. And (with the proviso that I know next to nothing about
Citus) I just cloned https://github.com/citusdata/citus.git and
skimmed through "git diff origin/release-7.4..origin/release-7.5", and
nothing is jumping out at me. Can you still see the problem with
Citus 7.4?

--
Thomas Munro
http://www.enterprisedb.com

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Shay Rojansky 2018-08-15 22:50:26 Re: Stored procedures and out parameters
Previous Message Andres Freund 2018-08-15 22:40:26 Re: C99 compliance for src/port/snprintf.c

Browse pgsql-performance by date

  From Date Subject
Next Message Fred Habash 2018-08-16 18:19:11 Guideline To Resolve LWLock:SubtransControlLock
Previous Message Sand Stone 2018-08-15 20:32:45 Re: dsa_allocate() faliure