Re: intermittent failures in Cygwin from select_parallel tests

From: Noah Misch <noah(at)leadboat(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Andrew Dunstan <andrew(dot)dunstan(at)2ndquadrant(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: intermittent failures in Cygwin from select_parallel tests
Date: 2017-08-03 03:47:40
Message-ID: 20170803034740.GA2641942@rfd.leadboat.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, Jun 21, 2017 at 06:44:09PM -0400, Tom Lane wrote:
> Today, lorikeet failed with a new variant on the bgworker start crash:
>
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=lorikeet&dt=2017-06-21%2020%3A29%3A10
>
> This one is even more exciting than the last one, because it sure looks
> like the crashing bgworker took the postmaster down with it. That is
> Not Supposed To Happen.
>
> Wondering if we broke something here recently, I tried to reproduce it
> on a Linux machine by adding a randomized Assert failure in
> shm_mq_set_sender. I don't see any such problem, even with EXEC_BACKEND;
> we recover from the crash as-expected.
>
> So I'm starting to get a distinct feeling that there's something wrong
> with the cygwin port. But I dunno what.

I think signal blocking broke on Cygwin.

On a system (gcc 5.4.0, CYGWIN_NT-10.0 2.7.0(0.306/5/3) 2017-02-12 13:18
x86_64) that reproduces lorikeet's symptoms, I instrumented the postmaster as
attached. The patch's small_parallel.sql is a subset of select_parallel.sql
sufficient to reproduce the mq_sender Assert failure and the postmaster silent
exit. (It occasionally needed hundreds of iterations to do so.) The parallel
query normally starts four bgworkers; when the mq_sender Assert fired, the
test had started five workers in response to four registrations.

The postmaster.c instrumentation regularly detects sigusr1_handler() calls
while another sigusr1_handler() is already on the stack:

6328 2017-08-02 07:25:42.788 GMT LOG: forbid signals @ sigusr1_handler
6328 2017-08-02 07:25:42.788 GMT DEBUG: saw slot-0 registration, want 0
6328 2017-08-02 07:25:42.788 GMT DEBUG: saw slot-0 registration, want 1
6328 2017-08-02 07:25:42.788 GMT DEBUG: slot 1 not yet registered
6328 2017-08-02 07:25:42.789 GMT DEBUG: registering background worker "parallel worker for PID 4776" (slot 1)
6328 2017-08-02 07:25:42.789 GMT DEBUG: saw slot-1 registration, want 2
6328 2017-08-02 07:25:42.789 GMT DEBUG: saw slot-0 registration, want 2
6328 2017-08-02 07:25:42.789 GMT DEBUG: slot 2 not yet registered
6328 2017-08-02 07:25:42.789 GMT DEBUG: registering background worker "parallel worker for PID 4776" (slot 2)
6328 2017-08-02 07:25:42.789 GMT DEBUG: saw slot-2 registration, want 3
6328 2017-08-02 07:25:42.789 GMT DEBUG: saw slot-1 registration, want 3
6328 2017-08-02 07:25:42.789 GMT DEBUG: saw slot-0 registration, want 3
6328 2017-08-02 07:25:42.789 GMT DEBUG: slot 3 not yet registered
6328 2017-08-02 07:25:42.789 GMT DEBUG: registering background worker "parallel worker for PID 4776" (slot 3)
6328 2017-08-02 07:25:42.789 GMT DEBUG: saw slot-3 registration, want 4
6328 2017-08-02 07:25:42.789 GMT DEBUG: saw slot-2 registration, want 4
6328 2017-08-02 07:25:42.789 GMT DEBUG: saw slot-1 registration, want 4
6328 2017-08-02 07:25:42.789 GMT DEBUG: saw slot-0 registration, want 4
6328 2017-08-02 07:25:42.789 GMT DEBUG: slot 4 not yet registered
6328 2017-08-02 07:25:42.789 GMT DEBUG: registering background worker "parallel worker for PID 4776" (slot 4)
6328 2017-08-02 07:25:42.789 GMT DEBUG: starting background worker process "parallel worker for PID 4776"
6328 2017-08-02 07:25:42.790 GMT LOG: forbid signals @ sigusr1_handler
6328 2017-08-02 07:25:42.790 GMT WARNING: signals already forbidden @ sigusr1_handler
6328 2017-08-02 07:25:42.790 GMT LOG: permit signals @ sigusr1_handler

postmaster algorithms rely on the PG_SETMASK() calls preventing that. Without
such protection, duplicate bgworkers are an understandable result. I caught
several other assertions; the PMChildFlags failure is another case of
duplicate postmaster children:

6 TRAP: FailedAssertion("!(entry->trans == ((void *)0))", File: "pgstat.c", Line: 871)
3 TRAP: FailedAssertion("!(PMSignalState->PMChildFlags[slot] == 1)", File: "pmsignal.c", Line: 229)
20 TRAP: FailedAssertion("!(RefCountErrors == 0)", File: "bufmgr.c", Line: 2523)
21 TRAP: FailedAssertion("!(vmq->mq_sender == ((void *)0))", File: "shm_mq.c", Line: 221)
Also, got a few "select() failed in postmaster: Bad address"

I suspect a Cygwin signals bug. I'll try to distill a self-contained test
case for the Cygwin hackers. The lack of failures on buildfarm member brolga
argues that older Cygwin is not affected.

Attachment Content-Type Size
cygwin-signal-debug-v1.patch text/plain 9.6 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2017-08-03 03:59:17 Re: [COMMITTERS] pgsql: Fix pg_dump's errno checking for zlib I/O
Previous Message Alvaro Herrera 2017-08-03 03:47:08 fixing pg_upgrade strings (was Re: pgsql: Add new files to nls.mk and add translation)