Re: pg11.1: dsa_area could not attach to segment

From: Justin Pryzby <pryzby(at)telsasoft(dot)com>
To: Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>
Cc: Sergei Kornilov <sk(at)zsrv(dot)org>, "pgsql-hackers(at)lists(dot)postgresql(dot)org" <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: pg11.1: dsa_area could not attach to segment
Date: 2019-02-12 02:14:28
Message-ID: 20190212021428.GA31721@telsasoft.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, Feb 12, 2019 at 10:57:51AM +1100, Thomas Munro wrote:
> > On current REL_11_STABLE branch with PANIC level i see this backtrace for failed parallel process:
> >
> > #0 __GI_raise (sig=sig(at)entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
> > #1 0x00007f3b36983535 in __GI_abort () at abort.c:79
> > #2 0x000055f03ab87a4e in errfinish (dummy=dummy(at)entry=0) at elog.c:555
> > #3 0x000055f03ab899e0 in elog_finish (elevel=elevel(at)entry=22, fmt=fmt(at)entry=0x55f03ad86900 "dsa_area could not attach to segment") at elog.c:1376
> > #4 0x000055f03abaa1e2 in get_segment_by_index (area=area(at)entry=0x55f03cdd6bf0, index=index(at)entry=7) at dsa.c:1743
> > #5 0x000055f03abaa8ab in get_best_segment (area=area(at)entry=0x55f03cdd6bf0, npages=npages(at)entry=8) at dsa.c:1993
> > #6 0x000055f03ababdb8 in dsa_allocate_extended (area=0x55f03cdd6bf0, size=size(at)entry=32768, flags=flags(at)entry=0) at dsa.c:701
>
> Ok, this contains some clues I didn't have before. Here we see that a
> request for a 32KB chunk of memory led to a traversal the linked list
> of segments in a given bin, and at some point we followed a link to
> segment index number 7, which turned out to be bogus. We tried to
> attach to the segment whose handle is stored in
> area->control->segment_handles[7] and it was not known to dsm.c. It
> wasn't DSM_HANDLE_INVALID, or you'd have got a different error
> message. That means that it wasn't a segment that had been freed by
> destroy_superblock(), or it'd hold DSM_HANDLE_INVALID.
>
> Hmm. So perhaps the bin list was corrupted (the segment index was bad

I think there is corruption *somewhere* due to never being able to do
this (and looks very broken?)

(gdb) p segment_map
$1 = (dsa_segment_map *) 0x1

(gdb) print segment_map->header
Cannot access memory at address 0x11

> Can we please see the stderr output of dsa_dump(area), added just
> before the PANIC? Can we see the value of "handle" when the error is
> raised, and the directory listing for /dev/shm (assuming Linux) after
> the crash (maybe you need restart_after_crash = off to prevent
> automatic cleanup)?

PANIC: dsa_area could not attach to segment index:8 handle:1076305344

I think it needs to be:

| if (segment == NULL) {
| LWLockRelease(DSA_AREA_LOCK(area));
| dsa_dump(area);
| elog(PANIC, "dsa_area could not attach to segment index:%zd handle:%d", index, handle);
| }

..but that triggers recursion:

#0 0x00000037b9c32495 in raise () from /lib64/libc.so.6
#1 0x00000037b9c33c75 in abort () from /lib64/libc.so.6
#2 0x0000000000a395c0 in errfinish (dummy=0) at elog.c:567
#3 0x0000000000a3bbf6 in elog_finish (elevel=22, fmt=0xc9faa0 "dsa_area could not attach to segment index:%zd handle:%d") at elog.c:1389
#4 0x0000000000a6b97a in get_segment_by_index (area=0x1659200, index=8) at dsa.c:1747
#5 0x0000000000a6a3dc in dsa_dump (area=0x1659200) at dsa.c:1093
#6 0x0000000000a6b946 in get_segment_by_index (area=0x1659200, index=8) at dsa.c:1744
[...]
#717 0x0000000000a6a3dc in dsa_dump (area=0x1659200) at dsa.c:1093
#718 0x0000000000a6b946 in get_segment_by_index (area=0x1659200, index=8) at dsa.c:1744
#719 0x0000000000a6a3dc in dsa_dump (area=0x1659200) at dsa.c:1093
#720 0x0000000000a6b946 in get_segment_by_index (area=0x1659200, index=8) at dsa.c:1744
#721 0x0000000000a6c150 in get_best_segment (area=0x1659200, npages=8) at dsa.c:1997
#722 0x0000000000a69680 in dsa_allocate_extended (area=0x1659200, size=32768, flags=0) at dsa.c:701
#723 0x00000000007052eb in ExecParallelHashTupleAlloc (hashtable=0x7f56ff9b40e8, size=112, shared=0x7fffda8c36a0) at nodeHash.c:2837
#724 0x00000000007034f3 in ExecParallelHashTableInsert (hashtable=0x7f56ff9b40e8, slot=0x1608948, hashvalue=2677813320) at nodeHash.c:1693
#725 0x0000000000700ba3 in MultiExecParallelHash (node=0x1607f40) at nodeHash.c:288
#726 0x00000000007007ce in MultiExecHash (node=0x1607f40) at nodeHash.c:112
#727 0x00000000006e94d7 in MultiExecProcNode (node=0x1607f40) at execProcnode.c:501
[...]

[pryzbyj(at)telsasoft-db postgresql]$ ls -lt /dev/shm |head
total 353056
-rw-------. 1 pryzbyj pryzbyj 1048576 Feb 11 13:51 PostgreSQL.821164732
-rw-------. 1 pryzbyj pryzbyj 2097152 Feb 11 13:51 PostgreSQL.1990121974
-rw-------. 1 pryzbyj pryzbyj 2097152 Feb 11 12:54 PostgreSQL.847060172
-rw-------. 1 pryzbyj pryzbyj 2097152 Feb 11 12:48 PostgreSQL.1369859581
-rw-------. 1 postgres postgres 21328 Feb 10 21:00 PostgreSQL.1155375187
-rw-------. 1 pryzbyj pryzbyj 196864 Feb 10 18:52 PostgreSQL.2136009186
-rw-------. 1 pryzbyj pryzbyj 2097152 Feb 10 18:49 PostgreSQL.1648026537
-rw-------. 1 pryzbyj pryzbyj 2097152 Feb 10 18:49 PostgreSQL.827867206
-rw-------. 1 pryzbyj pryzbyj 2097152 Feb 10 18:49 PostgreSQL.1684837530

Justin

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Justin Pryzby 2019-02-12 02:36:14 Re: pg11.1: dsa_area could not attach to segment
Previous Message Michael Paquier 2019-02-12 02:09:41 Re: Reporting script runtimes in pg_regress