Re: What is happening on buildfarm member crake?

From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Andrew Dunstan <andrew(at)dunslane(dot)net>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: What is happening on buildfarm member crake?
Date: 2014-01-20 01:22:42
Message-ID: CA+TgmoZBXXVMF-oH=JEv5BsshE7P8PTz25-1qjVWeMh8LbhRHg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Sun, Jan 19, 2014 at 7:53 PM, Andrew Dunstan <andrew(at)dunslane(dot)net> wrote:
> Also crake does produce backtraces on core dumps, and they are at the
> bottom of the buildfarm log. The latest failure backtrace is reproduced
> below.
>
> ================== stack trace:
> /home/bf/bfr/root/HEAD/inst/data-C/core.12584 ==================
> [New LWP 12584]
> [Thread debugging using libthread_db enabled]
> Using host libthread_db library "/lib64/libthread_db.so.1".
> Core was generated by `postgres: buildfarm
> contrib_regression_test_shm_mq'.
> Program terminated with signal 11, Segmentation fault.
> #0 SetLatch (latch=0x1c) at pg_latch.c:509
> 509 if (latch->is_set)
> #0 SetLatch (latch=0x1c) at pg_latch.c:509
> #1 0x000000000064c04e in procsignal_sigusr1_handler
> (postgres_signal_arg=<optimized out>) at
> /home/bf/bfr/root/HEAD/pgsql.25562/../pgsql/src/backend/storage/ipc/procsignal.c:289
> #2 <signal handler called>
> #3 _dl_fini () at dl-fini.c:190
> #4 0x000000361ba39931 in __run_exit_handlers (status=0,
> listp=0x361bdb1668, run_list_atexit=true) at exit.c:78
> #5 0x000000361ba399b5 in __GI_exit (status=<optimized out>) at
> exit.c:100
> #6 0x00000000006485a6 in proc_exit (code=0) at
> /home/bf/bfr/root/HEAD/pgsql.25562/../pgsql/src/backend/storage/ipc/ipc.c:143
> #7 0x0000000000663abb in PostgresMain (argc=<optimized out>,
> argv=<optimized out>, dbname=0x12b8170 "contrib_regression_test_shm_mq",
> username=<optimized out>) at
> /home/bf/bfr/root/HEAD/pgsql.25562/../pgsql/src/backend/tcop/postgres.c:4225
> #8 0x000000000062220f in BackendRun (port=0x12d6bf0) at
> /home/bf/bfr/root/HEAD/pgsql.25562/../pgsql/src/backend/postmaster/postmaster.c:4083
> #9 BackendStartup (port=0x12d6bf0) at
> /home/bf/bfr/root/HEAD/pgsql.25562/../pgsql/src/backend/postmaster/postmaster.c:3772
> #10 ServerLoop () at
> /home/bf/bfr/root/HEAD/pgsql.25562/../pgsql/src/backend/postmaster/postmaster.c:1583
> #11 PostmasterMain (argc=<optimized out>, argv=<optimized out>) at
> /home/bf/bfr/root/HEAD/pgsql.25562/../pgsql/src/backend/postmaster/postmaster.c:1238
> #12 0x000000000045e2e8 in main (argc=3, argv=0x12b7430) at
> /home/bf/bfr/root/HEAD/pgsql.25562/../pgsql/src/backend/main/main.c:205

Hmm, that looks an awful lot like the SIGUSR1 signal handler is
getting called after we've already completed shmem_exit. And indeed
that seems like the sort of thing that would result in dying horribly
in just this way. The obvious fix seems to be to check
proc_exit_inprogress before doing anything that might touch shared
memory, but there are a lot of other SIGUSR1 handlers that don't do
that either. However, in those cases, the likely cause of a SIGUSR1
would be a sinval catchup interrupt or a recovery conflict, which
aren't likely to be so far delayed that they arrive after we've
already disconnected from shared memory. But the dynamic background
workers stuff adds a new possible cause of SIGUSR1: the postmaster
letting us know that a child has started or died. And that could
happen even after we've detached shared memory.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Robert Haas 2014-01-20 01:25:28 Re: plpgsql.warn_shadow
Previous Message Andrew Dunstan 2014-01-20 00:53:19 Re: What is happening on buildfarm member crake?