Re: stress test for parallel workers

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Andrew Dunstan <andrew(dot)dunstan(at)2ndquadrant(dot)com>
Cc: mark(at)2ndquadrant(dot)com, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, Justin Pryzby <pryzby(at)telsasoft(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: stress test for parallel workers
Date: 2019-10-11 18:56:41
Message-ID: 14878.1570820201@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

I wrote:
> What we've apparently got here is that signals were received
> so fast that the postmaster ran out of stack space. I remember
> Andres complaining about this as a theoretical threat, but I
> hadn't seen it in the wild before.

> I haven't finished investigating though, as there are some things
> that remain to be explained.

I still don't have a good explanation for why this only seems to
happen in the pg_upgrade test sequence. However, I did notice
something very interesting: the postmaster crashes after consuming
only about 1MB of stack space. This is despite the prevailing
setting of "ulimit -s" being 8192 (8MB). I also confirmed that
the value of max_stack_depth within the crashed process is 2048,
which implies that get_stack_depth_rlimit got some value larger
than 2MB from getrlimit(RLIMIT_STACK). And yet, here we have
a crash, and the process memory map confirms that only 1MB was
allocated in the stack region. So it's really hard to explain
that as anything except a kernel bug: sometimes, the kernel
doesn't give us as much stack as it promised it would. And the
machine is not loaded enough for there to be any rational
resource-exhaustion excuse for that.

This matches up with the intermittent infinite_recurse failures
we've been seeing in the buildfarm. Those are happening across
a range of systems, but they're (almost) all Linux-based ppc64,
suggesting that there's a longstanding arch-specific kernel bug
involved. For reference, I scraped the attached list of such
failures in the last three months. I wonder whether we can get
the attention of any kernel hackers about that.

Anyway, as to what to do about it --- it occurred to me to wonder
why we are relying on having the signal handlers block and unblock
signals manually, when we could tell sigaction() that we'd like
signals blocked. It is reasonable to expect that the signal support
is designed to not recursively consume stack space in the face of
a series of signals, while the way we are doing it clearly opens
us up to recursive space consumption. The stack trace I showed
before proves that the recursion happens at the points where the
signal handlers unblock signals.

As a quick hack I made the attached patch, and it seems to fix the
problem on wobbegong's host. I don't see crashes any more, and
watching the postmaster's stack space consumption, it stays
comfortably at a tad under 200KB (probably the default initial
allocation), while without the patch it tends to blow up to 700K
or more even in runs that don't crash.

This patch isn't committable as-is because it will (I suppose)
break things on Windows; we still need the old way there for lack
of sigaction(). But that could be fixed with a few #ifdefs.
I'm also kind of tempted to move pqsignal_no_restart into
backend/libpq/pqsignal.c (where BlockSig is defined) and maybe
rename it, but I'm not sure to what.

This issue might go away if we switched to a postmaster implementation
that doesn't do work in the signal handlers, but I'm not entirely
convinced of that. The existing handlers don't seem to consume a lot
of stack space in themselves (there's not many local variables in them).
The bulk of the stack consumption is seemingly in the platform's signal
infrastructure, so that we might still have a stack consumption issue
even with fairly trivial handlers, if we don't tell sigaction to block
signals. In any case, this fix seems potentially back-patchable,
while we surely wouldn't risk back-patching a postmaster rewrite.

Comments?

regards, tom lane

Attachment Content-Type Size
recent-infinite_recurse-failures.txt text/plain 9.1 KB
let-sigaction-do-the-blocking-wip.patch text/x-diff 2.2 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2019-10-11 19:05:22 Re: Connect as multiple users using single client certificate
Previous Message Christoph Berg 2019-10-11 18:54:20 Re: pgsql: Remove pqsignal() from libpq's official exports list.