Re: logical replication launcher crash on buildfarm

From: Andres Freund <andres(at)anarazel(dot)de>
To: Petr Jelinek <petr(dot)jelinek(at)2ndquadrant(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org, Petr Jelinek <petr(at)2ndquadrant(dot)com>, Peter Eisentraut <peter_e(at)gmx(dot)net>
Subject: Re: logical replication launcher crash on buildfarm
Date: 2017-03-27 16:50:23
Message-ID: 20170327165023.t224nralrspk6udc@alap3.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 2017-03-16 10:13:37 +0100, Petr Jelinek wrote:
> On 16/03/17 09:53, Andres Freund wrote:
> > On 2017-03-16 09:40:48 +0100, Petr Jelinek wrote:
> >> On 16/03/17 04:42, Andres Freund wrote:
> >>> On 2017-03-15 20:28:33 -0700, Andres Freund wrote:
> >>>> Hi,
> >>>>
> >>>> I just unstuck a bunch of my buildfarm animals. That triggered some
> >>>> spurious failures (on piculet, calliphoridae, mylodon), but also one
> >>>> that doesn't really look like that:
> >>>> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=culicidae&dt=2017-03-16%2002%3A40%3A03
> >>>>
> >>>> with the pertinent point being:
> >>>>
> >>>> ================== stack trace: pgsql.build/src/test/regress/tmp_check/data/core ==================
> >>>> [New LWP 1894]
> >>>> [Thread debugging using libthread_db enabled]
> >>>> Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
> >>>> Core was generated by `postgres: bgworker: logical replication launcher '.
> >>>> Program terminated with signal SIGSEGV, Segmentation fault.
> >>>> #0 0x000055e265bff5e3 in ?? ()
> >>>> #0 0x000055e265bff5e3 in ?? ()
> >>>> #1 0x000055d3ccabed0d in StartBackgroundWorker () at /home/andres/build/buildfarm-culicidae/HEAD/pgsql.build/../pgsql/src/backend/postmaster/bgworker.c:792
> >>>> #2 0x000055d3ccacf4fc in SubPostmasterMain (argc=3, argv=0x55d3cdbb71c0) at /home/andres/build/buildfarm-culicidae/HEAD/pgsql.build/../pgsql/src/backend/postmaster/postmaster.c:4878
> >>>> #3 0x000055d3cca443ea in main (argc=3, argv=0x55d3cdbb71c0) at /home/andres/build/buildfarm-culicidae/HEAD/pgsql.build/../pgsql/src/backend/main/main.c:205
> >>>>
> >>>> it's possible that me killing things and upgrading caused this, but
> >>>> given this is a backend running EXEC_BACKEND, I'm a bit suspicous that
> >>>> it's more than that. The machine is a bit backed up at the moment, so
> >>>> it'll probably be a while till it's at that animal/branch again,
> >>>> otherwise I'd not have mentioned this.
> >>>
> >>> For some reason it ran again pretty soon. And I'm afraid it's indeed an
> >>> issue:
> >>> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=culicidae&dt=2017-03-16%2003%3A30%3A02
> >>>
> >>
> >> Hmm, I tried with EXEC_BACKEND (and with --disable-spinlocks) and it
> >> seems to work fine on my two machines. I don't see anything else
> >> different on culicidae though. Sadly the backtrace is not that
> >> informative either. I'll try to investigate more but it will take time...
> >
> > Worthwhile additional failure:
> > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=culicidae&dt=2017-03-16%2002%3A55%3A01
> >
> > Same animal, also EXEC_BACKEND, but 9.6.
> >
> > A quick look at the relevant line:
> > /*
> > * If bgw_main is set, we use that value as the initial entrypoint.
> > * However, if the library containing the entrypoint wasn't loaded at
> > * postmaster startup time, passing it as a direct function pointer is not
> > * possible. To work around that, we allow callers for whom a function
> > * pointer is not available to pass a library name (which will be loaded,
> > * if necessary) and a function name (which will be looked up in the named
> > * library).
> > */
> > if (worker->bgw_main != NULL)
> > entrypt = worker->bgw_main;
> >
> > makes the issue clear - we appear to be assuming that bgw_main is
> > meaningful across processes. Which it isn't in the EXEC_BACKEND case
> > when ASLR is in use...
> >
> > This kinda sounds familiar, but a quick google search doesn't find
> > anything relevant.

Robert, Petr, either of you planning to fix this (as outlined elsewhere
in the thred)?

> Hmm now that you mention it, I remember discussing something similar
> with you last year in Dallas in regards to parallel query. IIRC Windows
> should not have this problem but other systems with EXEC_BACKEND do.
> Don't remember the details though.

Don't think that's reliable, only works as long as the binary is
compiled without position independent code.

Greetings,

Andres Freund

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Robert Haas 2017-03-27 16:55:00 Re: segfault in hot standby for hash indexes
Previous Message Robert Haas 2017-03-27 16:41:00 Re: crashes due to setting max_parallel_workers=0