Re: dsa_allocate() faliure

From: Sand Stone <sand(dot)m(dot)stone(at)gmail(dot)com>
To: Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>
Cc: Rick Otten <rottenwindfish(at)gmail(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-performance(at)lists(dot)postgresql(dot)org, Robert Haas <robertmhaas(at)gmail(dot)com>
Subject: Re: dsa_allocate() faliure
Date: 2018-08-25 14:46:32
Message-ID: CADrk5qP4rBLVa4t6VOyzNKGB-4JCexaFMqa2BybcooAjvmM3sQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers pgsql-performance

>Can you still see the problem with Citus 7.4?
Hi, Thomas. I actually went back to the cluster with Citus7.4 and
PG10.4. And modified the parallel param. So far, I haven't seen any
server crash.

The main difference between crashes observed and no crash, is the set
of Linux TCP time out parameters (to release the ports faster).
Unfortunately, I cannot "undo" the Linux params and run the stress
tests anymore, as this is a multi-million $ cluster and people are
doing more useful things on it. I will keep an eye on any parallel
execution issue.

On Wed, Aug 15, 2018 at 3:43 PM Thomas Munro
<thomas(dot)munro(at)enterprisedb(dot)com> wrote:
>
> On Thu, Aug 16, 2018 at 8:32 AM, Sand Stone <sand(dot)m(dot)stone(at)gmail(dot)com> wrote:
> > Just as a follow up. I tried the parallel execution again (in a stress
> > test environment). Now the crash seems gone. I will keep an eye on
> > this for the next few weeks.
>
> Thanks for the report. That's great news, but it'd be good to
> understand why it was happening.
>
> > My theory is that the Citus cluster created and shut down a lot of TCP
> > connections between coordinator and workers. If running on untuned
> > Linux machines, the TCP ports might run out.
>
> I'm not sure how that's relevant, unless perhaps it causes executor
> nodes to be invoked in a strange sequence that commit fd7c0fa7 didn't
> fix? I wonder if there could be something different about the control
> flow with custom scans, or something about the way Citus worker nodes
> invoke plan fragments, or some error path that I failed to consider...
> It's a clue that all of your worker nodes reliably crashed at the same
> time on the same/similar queries (presumably distributed query
> fragments for different shards), making it seem more like a
> common-or-garden bug rather than some kind of timing-based heisenbug.
> If you ever manage to reproduce it, an explain plan and a back trace
> would be very useful.
>
> > Of course, I am using "newer" PG10 bits and Citus7.5 this time.
>
> Hmm. There weren't any relevant commits to REL_10_STABLE that I can
> think of. And (with the proviso that I know next to nothing about
> Citus) I just cloned https://github.com/citusdata/citus.git and
> skimmed through "git diff origin/release-7.4..origin/release-7.5", and
> nothing is jumping out at me. Can you still see the problem with
> Citus 7.4?
>
> --
> Thomas Munro
> http://www.enterprisedb.com

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Bruce Momjian 2018-08-25 17:01:39 Re: Reference link for applicable_roles view is missing in the documentation of enabled_roles
Previous Message Amit Kapila 2018-08-25 09:20:53 Re: Problem while setting the fpw with SIGHUP

Browse pgsql-performance by date

  From Date Subject
Next Message David 2018-08-29 03:31:10 Extremely slow when query uses GIST exclusion index
Previous Message Fd Habash 2018-08-22 20:07:53 RE: Guideline To Resolve LWLock:SubtransControlLock