Re: Getting server crash after running sqlsmith

From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: tushar <tushar(dot)ahuja(at)enterprisedb(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Getting server crash after running sqlsmith
Date: 2017-03-29 01:49:42
Message-ID: CA+Tgmobz6AHtQVqHQa-CCm4_yWygZ8HC0KUMBUw63583KiyypA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, Mar 28, 2017 at 9:23 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> Robert Haas <robertmhaas(at)gmail(dot)com> writes:
>> On Tue, Mar 28, 2017 at 2:36 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>>> Hm ... I don't see a crash here, but I wonder whether you have parameters
>>> set that would cause this query to be run as a parallel query? Because
>>> pg_rotate_logfile() is marked as parallel-safe in pg_proc, which seems
>>> probably insane.
>
>> /me blinks
>
>> Uh, what's insane about that? All it does is test a GUC (which is
>> surely parallel-safe) and call SendPostmasterSignal (which seems safe,
>> too).
>
> Well, if you don't like that theory, what's yours?

Gremlins?

The stack trace seems to show that the process is receiving SIGUSR1 at
a very high rate. Every time sigusr1_handler() reaches
PG_SETMASK(&UnBlockSig), it immediately gets a SIGUSR1 and jumps back
into sigusr1_handler(). Now, this seems like a design flaw in
sigusr1_handler(). Likely the operating system blocks SIGUSR1 on
entry to the signal handler so that it's not possible for a high rate
of signal delivery to blow out the stack, but we forcibly unblock it
before returning, thus exposing ourselves to blowing out the stack.
And we have, apparently, no stack depth check here nor any other way
of preventing the infinite recursion.

I imagine here the behavior is platform-dependent, but I'd guess that
select pg_current_logfile() from generate_series(1,1000000) g might
reproduce this on affected platforms with or without parallel query in
the mix. It looks like we've conveniently provided both a function
that can be used to SIGUSR1 the heck out of the postmaster and a
postmaster that is, at least on such platforms, vulnerable to crashing
if you do that.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Peter Eisentraut 2017-03-29 02:06:21 Re: cast result of copyNode()
Previous Message Thomas Munro 2017-03-29 01:49:10 Re: Getting server crash after running sqlsmith