Re: pg11.1: dsa_area could not attach to segment

From: Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>
To: Sergei Kornilov <sk(at)zsrv(dot)org>
Cc: Justin Pryzby <pryzby(at)telsasoft(dot)com>, "pgsql-hackers(at)lists(dot)postgresql(dot)org" <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: pg11.1: dsa_area could not attach to segment
Date: 2019-02-11 23:57:51
Message-ID: CAEepm=2=BFXs_+8X-eEyNBpG0OddaWe190KQaOX2TE5aS8UL-A@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, Feb 12, 2019 at 1:51 AM Sergei Kornilov <sk(at)zsrv(dot)org> wrote:
> > Here's confirmed steps to reproduce
>
> Wow, i confirm this testcase is reproducible for me. On my 4-core desktop i see "dsa_area could not attach to segment" error after minute or two.

Well that's something -- thanks for this report. I've had 3 different
machines (laptops and servers, with an without optimisation enabled,
clang and gcc, 3 different OSes) grinding away on Justin's test case
for many hours today, without seeing the problem.

> On current REL_11_STABLE branch with PANIC level i see this backtrace for failed parallel process:
>
> #0 __GI_raise (sig=sig(at)entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
> #1 0x00007f3b36983535 in __GI_abort () at abort.c:79
> #2 0x000055f03ab87a4e in errfinish (dummy=dummy(at)entry=0) at elog.c:555
> #3 0x000055f03ab899e0 in elog_finish (elevel=elevel(at)entry=22, fmt=fmt(at)entry=0x55f03ad86900 "dsa_area could not attach to segment") at elog.c:1376
> #4 0x000055f03abaa1e2 in get_segment_by_index (area=area(at)entry=0x55f03cdd6bf0, index=index(at)entry=7) at dsa.c:1743
> #5 0x000055f03abaa8ab in get_best_segment (area=area(at)entry=0x55f03cdd6bf0, npages=npages(at)entry=8) at dsa.c:1993
> #6 0x000055f03ababdb8 in dsa_allocate_extended (area=0x55f03cdd6bf0, size=size(at)entry=32768, flags=flags(at)entry=0) at dsa.c:701

Ok, this contains some clues I didn't have before. Here we see that a
request for a 32KB chunk of memory led to a traversal the linked list
of segments in a given bin, and at some point we followed a link to
segment index number 7, which turned out to be bogus. We tried to
attach to the segment whose handle is stored in
area->control->segment_handles[7] and it was not known to dsm.c. It
wasn't DSM_HANDLE_INVALID, or you'd have got a different error
message. That means that it wasn't a segment that had been freed by
destroy_superblock(), or it'd hold DSM_HANDLE_INVALID.

Hmm. So perhaps the bin list was corrupted (the segment index was bad
due to some bogus list manipulation logic or memory overrun or...), or
we corrupted our array of handles, or there is some missing locking
somewhere (all bin manipulation and traversal should be protected by
the area lock), or a valid DSM handle was unexpectedly missing (dsm.c
bug, bogus shm_open() EEXIST from the OS).

Can we please see the stderr output of dsa_dump(area), added just
before the PANIC? Can we see the value of "handle" when the error is
raised, and the directory listing for /dev/shm (assuming Linux) after
the crash (maybe you need restart_after_crash = off to prevent
automatic cleanup)?

--
Thomas Munro
http://www.enterprisedb.com

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Thomas Munro 2019-02-12 00:07:35 Re: pg11.1: dsa_area could not attach to segment
Previous Message Ashwin Agrawal 2019-02-11 23:55:30 Re: Make drop database safer