Re: SIGSEGV in BRIN autosummarize

From: Justin Pryzby <pryzby(at)telsasoft(dot)com>
To: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org, Alvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org>
Subject: Re: SIGSEGV in BRIN autosummarize
Date: 2017-10-15 01:56:56
Message-ID: 20171015015656.GC22678@telsasoft.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Fri, Oct 13, 2017 at 10:57:32PM -0500, Justin Pryzby wrote:
> > Also notice the vacuum process was interrupted, same as yesterday (think
> > goodness for full logs). Our INSERT script is using python
> > multiprocessing.pool() with "maxtasksperchild=1", which I think means we load
> > one file and then exit the subprocess, and pool() creates a new subproc, which
> > starts a new PG session and transaction. Which explains why autovacuum starts
> > processing the table only to be immediately interrupted.

On Sun, Oct 15, 2017 at 01:57:14AM +0200, Tomas Vondra wrote:
> I don't follow. Why does it explain that autovacuum gets canceled? I
> mean, merely opening a new connection/session should not cancel
> autovacuum. That requires a command that requires table-level lock
> conflicting with autovacuum (so e.g. explicit LOCK command, DDL, ...).

I was thinking that INSERT would do it, but I gather you're right about
autovacuum. Let me get back to you about this..

> > Due to a .."behavioral deficiency" in the loader for those tables, the crashed
> > backend causes the loader to get stuck, so the tables should be untouched since
> > the crash, should it be desirable to inspect them.
> >
>
> It's a bit difficult to guess what went wrong from this backtrace. For
> me gdb typically prints a bunch of lines immediately before the frames,
> explaining what went wrong - not sure why it's missing here.

Do you mean this ?

...
Loaded symbols for /lib64/libnss_files-2.12.so
Core was generated by `postgres: autovacuum worker process gtt '.
Program terminated with signal 11, Segmentation fault.
#0 pfree (pointer=0x298c740) at mcxt.c:954
954 (*context->methods->free_p) (context, pointer);

> Perhaps some of those pointers are bogus, the memory was already pfree-d
> or something like that. You'll have to poke around and try dereferencing
> the pointers to find what works and what does not.
>
> For example what do these gdb commands do in the #0 frame?
>
> (gdb) p *(MemoryContext)context

(gdb) p *(MemoryContext)context
Cannot access memory at address 0x7474617261763a20

> (gdb) p *GetMemoryChunkContext(pointer)

(gdb) p *GetMemoryChunkContext(pointer)
No symbol "GetMemoryChunkContext" in current context.

I had to do this since it's apparently inlined/macro:
(gdb) p *(MemoryContext *) (((char *) pointer) - sizeof(void *))
$8 = (MemoryContext) 0x7474617261763a20

I uploaded the corefile:
http://telsasoft.com/tmp/coredump-postgres-autovacuum-brin-summarize.gz

Justin

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Pavel Stehule 2017-10-15 10:06:11 Re: proposal - Default namespaces for XPath expressions (PostgreSQL 11)
Previous Message Joe Conway 2017-10-15 01:51:39 Re: pg_regress help output