Re: Valgrind failures in Apply Launcher's bgworker_quickdie() exit

From: Andres Freund <andres(at)anarazel(dot)de>
To: Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Valgrind failures in Apply Launcher's bgworker_quickdie() exit
Date: 2018-12-16 20:57:33
Message-ID: 20181216205733.d4otwngn5kk3juhk@alap3.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

On 2018-12-16 22:33:00 +1100, Thomas Munro wrote:
> On Fri, Dec 14, 2018 at 4:14 PM Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> > Andres Freund <andres(at)anarazel(dot)de> writes:
> > > On December 13, 2018 6:01:04 PM PST, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> > >> Has anyone tried to reproduce this on other platforms?
> >
> > > I recently also hit this locally, but since that's also Debian unstable... Note that removing openssl "fixed" the issue for me.
> >
> > FWIW, I tried to reproduce this on Fedora 28 and RHEL6, without success.
> > It's possible that there's some significant detail of your configuration
> > that I didn't match, but on the whole "bug in Debian unstable" seems
> > like the most probable theory right now.
>
> I was keen to try to bisect this, but I couldn't reproduce it on a
> freshly upgraded Debian unstable VM, with --with-openssl, using "make
> installcheck" under src/test/authentication. I even tried using the
> gold linker as skink does. Maybe I'm using the wrong checker
> options... Andres, can we see your exact valgrind invocation?

Ok, I think I've narrowed this down a bit further. But far from
completely. I don't think you need particularly special options, but
it's easy to miss the error, because it doesn't cause postmaster to exit
with an error.

It only happens when a bgworker is shutdown with SIGQUIT (be it
directly, or via postmaster immediate shutdown):

$ valgrind --quiet --error-exitcode=55 --suppressions=/home/andres/src/postgresql/src/tools/valgrind.supp --suppressions=/home/andres/tmp/valgrind-global.supp --trace-children=yes --track-origins=yes --read-var-info=no --num-callers=20 --leak-check=no --gen-suppressions=all /home/andres/build/postgres/dev-assert/vpath/src/backend/postgres -D /srv/dev/pgdev-dev
2018-12-16 12:53:26.274 PST [1187] LOG: listening on IPv4 address "127.0.0.1", port 5433

$ kill -QUIT 1187

==1194== Invalid read of size 8
==1194== at 0x4C3B5A5: check_free (dlerror.c:188)
==1194== by 0x4C3BAB1: free_key_mem (dlerror.c:221)
==1194== by 0x4C3BAB1: __dlerror_main_freeres (dlerror.c:239)
==1194== by 0x53D6F81: __libc_freeres (in /lib/x86_64-linux-gnu/libc-2.28.so)
==1194== by 0x482D19E: _vgnU_freeres (vg_preloaded.c:77)
==1194== by 0x567F54: bgworker_quickdie (bgworker.c:662)
==1194== by 0x48A86AF: ??? (in /lib/x86_64-linux-gnu/libpthread-2.28.so)
==1194== by 0x5367B76: epoll_wait (epoll_wait.c:30)
==1194== by 0x5EE7CC: WaitEventSetWaitBlock (latch.c:1078)
==1194== by 0x5EE6A5: WaitEventSetWait (latch.c:1030)
==1194== by 0x5EDDBC: WaitLatchOrSocket (latch.c:407)
==1194== by 0x5EDC23: WaitLatch (latch.c:347)
==1194== by 0x5992D7: ApplyLauncherMain (launcher.c:1062)
==1194== by 0x568245: StartBackgroundWorker (bgworker.c:835)
==1194== by 0x57C295: do_start_bgworker (postmaster.c:5742)
==1194== by 0x57C631: maybe_start_bgworkers (postmaster.c:5955)
==1194== by 0x578C3C: reaper (postmaster.c:2940)
==1194== by 0x48A86AF: ??? (in /lib/x86_64-linux-gnu/libpthread-2.28.so)
==1194== by 0x535F3B6: select (select.c:41)
==1194== by 0x576A9F: ServerLoop (postmaster.c:1677)
==1194== by 0x57642A: PostmasterMain (postmaster.c:1386)
==1194== Address 0x708d488 is 12 bytes after a block of size 12 alloc'd
==1194== at 0x483577F: malloc (vg_replace_malloc.c:299)
==1194== by 0x4AD8D38: CRYPTO_zalloc (mem.c:230)
==1194== by 0x4AD4F8D: ossl_init_get_thread_local (init.c:66)
==1194== by 0x4AD4F8D: ossl_init_get_thread_local (init.c:59)
==1194== by 0x4AD4F8D: ossl_init_thread_start (init.c:426)
==1194== by 0x4AFE5B9: RAND_DRBG_get0_public (drbg_lib.c:1118)
==1194== by 0x4AFE5EF: drbg_bytes (drbg_lib.c:963)
==1194== by 0x7F6DD9: pg_strong_random (pg_strong_random.c:135)
==1194== by 0x57B70F: RandomCancelKey (postmaster.c:5251)
==1194== by 0x57C367: assign_backendlist_entry (postmaster.c:5822)
==1194== by 0x57C0F2: do_start_bgworker (postmaster.c:5692)
==1194== by 0x57C631: maybe_start_bgworkers (postmaster.c:5955)
==1194== by 0x578C3C: reaper (postmaster.c:2940)
==1194== by 0x48A86AF: ??? (in /lib/x86_64-linux-gnu/libpthread-2.28.so)
==1194== by 0x535F3B6: select (select.c:41)
==1194== by 0x576A9F: ServerLoop (postmaster.c:1677)
==1194== by 0x57642A: PostmasterMain (postmaster.c:1386)
==1194== by 0x4997E0: main (main.c:228)

I now suspect this is a more longrunning issue than I thought. Not all
my valgrind buildfarm branches have ssl enabled (due to an ssl issue a
while back). And previously this wouldn't have been caught, because it
doesn't cause postmaster to fail, it's just that Andrew added a script
that checks logs for valgrind bleats.

The interesting bit is that if I replace the _exit(2) in
bgworker_quickdie() with an exit(2) (i.e. processing atexit handlers),
or manully add an OPENSSL_cleanup() before the _exit(2), valgrind
doesn't find errors.

The fact that one needs an immediate shutdown in a bgworker, with
openssl enabled, explains why this is hard to hit...

Greetings,

Andres Freund

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2018-12-16 21:01:42 Re: Improving collation-dependent indexes in system catalogs
Previous Message Alvaro Herrera 2018-12-16 20:47:16 Re: don't create storage when unnecessary