Re: pg11.1: dsa_area could not attach to segment

From: Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>
To: Justin Pryzby <pryzby(at)telsasoft(dot)com>
Cc: Sergei Kornilov <sk(at)zsrv(dot)org>, "pgsql-hackers(at)lists(dot)postgresql(dot)org" <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: pg11.1: dsa_area could not attach to segment
Date: 2019-02-12 04:09:42
Message-ID: CAEepm=34-Kq+fLi+0FTmS-nBdhC2d-z1i7Pe2ZDD4MbEWQeE9g@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, Feb 12, 2019 at 1:14 PM Justin Pryzby <pryzby(at)telsasoft(dot)com> wrote:
> On Tue, Feb 12, 2019 at 10:57:51AM +1100, Thomas Munro wrote:
> > > On current REL_11_STABLE branch with PANIC level i see this backtrace for failed parallel process:
> > >
> > > #0 __GI_raise (sig=sig(at)entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
> > > #1 0x00007f3b36983535 in __GI_abort () at abort.c:79
> > > #2 0x000055f03ab87a4e in errfinish (dummy=dummy(at)entry=0) at elog.c:555
> > > #3 0x000055f03ab899e0 in elog_finish (elevel=elevel(at)entry=22, fmt=fmt(at)entry=0x55f03ad86900 "dsa_area could not attach to segment") at elog.c:1376
> > > #4 0x000055f03abaa1e2 in get_segment_by_index (area=area(at)entry=0x55f03cdd6bf0, index=index(at)entry=7) at dsa.c:1743
> > > #5 0x000055f03abaa8ab in get_best_segment (area=area(at)entry=0x55f03cdd6bf0, npages=npages(at)entry=8) at dsa.c:1993
> > > #6 0x000055f03ababdb8 in dsa_allocate_extended (area=0x55f03cdd6bf0, size=size(at)entry=32768, flags=flags(at)entry=0) at dsa.c:701
> >
> > Ok, this contains some clues I didn't have before. Here we see that a
> > request for a 32KB chunk of memory led to a traversal the linked list
> > of segments in a given bin, and at some point we followed a link to
> > segment index number 7, which turned out to be bogus. We tried to
> > attach to the segment whose handle is stored in
> > area->control->segment_handles[7] and it was not known to dsm.c. It
> > wasn't DSM_HANDLE_INVALID, or you'd have got a different error
> > message. That means that it wasn't a segment that had been freed by
> > destroy_superblock(), or it'd hold DSM_HANDLE_INVALID.
> >
> > Hmm. So perhaps the bin list was corrupted (the segment index was bad
>
> I think there is corruption *somewhere* due to never being able to do
> this (and looks very broken?)
>
> (gdb) p segment_map
> $1 = (dsa_segment_map *) 0x1
>
> (gdb) print segment_map->header
> Cannot access memory at address 0x11

If you're in get_segment_by_index() in a core dumped at the "could not
attach" error, the variable segment_map hasn't been assigned a value
yet so that's uninitialised junk on the stack.

> > Can we please see the stderr output of dsa_dump(area), added just
> > before the PANIC? Can we see the value of "handle" when the error is
> > raised, and the directory listing for /dev/shm (assuming Linux) after
> > the crash (maybe you need restart_after_crash = off to prevent
> > automatic cleanup)?
>
> PANIC: dsa_area could not attach to segment index:8 handle:1076305344
>
> I think it needs to be:
>
> | if (segment == NULL) {
> | LWLockRelease(DSA_AREA_LOCK(area));
> | dsa_dump(area);
> | elog(PANIC, "dsa_area could not attach to segment index:%zd handle:%d", index, handle);
> | }

Right.

> ..but that triggers recursion:
>
> #0 0x00000037b9c32495 in raise () from /lib64/libc.so.6
> #1 0x00000037b9c33c75 in abort () from /lib64/libc.so.6
> #2 0x0000000000a395c0 in errfinish (dummy=0) at elog.c:567
> #3 0x0000000000a3bbf6 in elog_finish (elevel=22, fmt=0xc9faa0 "dsa_area could not attach to segment index:%zd handle:%d") at elog.c:1389
> #4 0x0000000000a6b97a in get_segment_by_index (area=0x1659200, index=8) at dsa.c:1747
> #5 0x0000000000a6a3dc in dsa_dump (area=0x1659200) at dsa.c:1093
> #6 0x0000000000a6b946 in get_segment_by_index (area=0x1659200, index=8) at dsa.c:1744

Ok, that makes sense.

> [pryzbyj(at)telsasoft-db postgresql]$ ls -lt /dev/shm |head
> total 353056
> -rw-------. 1 pryzbyj pryzbyj 1048576 Feb 11 13:51 PostgreSQL.821164732
> -rw-------. 1 pryzbyj pryzbyj 2097152 Feb 11 13:51 PostgreSQL.1990121974
> -rw-------. 1 pryzbyj pryzbyj 2097152 Feb 11 12:54 PostgreSQL.847060172
> -rw-------. 1 pryzbyj pryzbyj 2097152 Feb 11 12:48 PostgreSQL.1369859581
> -rw-------. 1 postgres postgres 21328 Feb 10 21:00 PostgreSQL.1155375187
> -rw-------. 1 pryzbyj pryzbyj 196864 Feb 10 18:52 PostgreSQL.2136009186
> -rw-------. 1 pryzbyj pryzbyj 2097152 Feb 10 18:49 PostgreSQL.1648026537
> -rw-------. 1 pryzbyj pryzbyj 2097152 Feb 10 18:49 PostgreSQL.827867206
> -rw-------. 1 pryzbyj pryzbyj 2097152 Feb 10 18:49 PostgreSQL.1684837530

Hmm. While contemplating this evidence that you have multiple
postgres clusters running, I started wondering if there could be some
way for two different DSA areas to get their DSM segments mixed up,
perhaps related to the way that 11 generates identical sequences of
random() numbers in all parallel workers. But I'm not seeing it; the
O_CREAT | O_EXCL protocol seems to avoid that (not to mention
permissions).

You truncated that list at 10 lines with "head"... do you know if
PostgreSQL.1076305344 was present? Or if you do it again, the one who
name matches the value of "handle".

--
Thomas Munro
http://www.enterprisedb.com

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Michael Paquier 2019-02-12 04:12:13 Re: Connection slots reserved for replication
Previous Message Kevin Hale Boyes 2019-02-12 03:57:41 Re: Connection slots reserved for replication