Re: dsa_allocate() faliure

From: Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>
To: Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>
Cc: Jakub Glapa <jakub(dot)glapa(at)gmail(dot)com>, Justin Pryzby <pryzby(at)telsasoft(dot)com>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: dsa_allocate() faliure
Date: 2018-11-27 03:00:34
Message-ID: CAEepm=2McVb9t3cS0yfoKoxBXFxhbJrn5rApq6CSMDgQ0OUGww@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers pgsql-performance

On Tue, Nov 27, 2018 at 7:45 AM Alvaro Herrera <alvherre(at)2ndquadrant(dot)com> wrote:
> On 2018-Nov-26, Jakub Glapa wrote:
> > Justin thanks for the information!
> > I'm running Ubuntu 16.04.
> > I'll try to prepare for the next crash.
> > Couldn't find anything this time.
>
> As I recall, the appport stuff in Ubuntu is terrible ... I've seen it
> take 40 minutes to write the crash dump to disk, during which the
> database was "down". I don't know why it is so slow (it's a rather
> silly python script that apparently processes the core dump one byte at
> a time, and you can imagine that with a few gigabytes of shared memory
> that takes a while). Anyway my recommendation was to *remove* that
> stuff from the server and make sure the core file is saved by normal
> means.

Thanks for CC-ing me. I didn't see this thread earlier because I'm
not subscribed to -performance. Let's move it over to -hackers since
it looks like it's going to be a debugging exercise. So, reading
through the thread[1], I think there might be two independent problems
here:

1. Jakub has a many-partition Parallel Bitmap Heap Scan query that
segfaults when run with max_parallel_workers = 0. That sounds
suspiciously like an instance of a class of bug we've run into before.
We planned a parallel query, but were unable to launch one due to lack
of DSM slots or process slots, so we run the parallel plan in a kind
of degraded non-parallel mode that needs to cope with various pointers
into shared memory being NULL. A back trace from a core file should
hopefully make it very obvious what's going on.

2. The same query when run in real parallel query mode occasionally
reaches an error "dsa_allocate could not find 7 free pages", which
should not happen. This is on 10.6, so it has the commit "Fix
segment_bins corruption in dsa.c.".

Hmm. I will see if I can come up with a many-partition torture test
reproducer for this.

[1] https://www.postgresql.org/message-id/flat/CAJk1zg10iCNsxFvQ4pgKe1B0rdjNG9iELA7AzLXjXnQm5T%3DKzQ%40mail.gmail.com

--
Thomas Munro
http://www.enterprisedb.com

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Andres Freund 2018-11-27 03:00:35 Re: pg11.1 jit segv
Previous Message Michael Paquier 2018-11-27 02:40:25 Re: A WalSnd issue related to state WALSNDSTATE_STOPPING

Browse pgsql-performance by date

  From Date Subject
Next Message Thomas Munro 2018-11-27 08:02:29 Re: dsa_allocate() faliure
Previous Message Alvaro Herrera 2018-11-26 18:45:09 Re: dsa_allocate() faliure