Re: problems on Solaris

From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Andres Freund <andres(at)anarazel(dot)de>
Cc: Stefan Kaltenbrunner <stefan(at)kaltenbrunner(dot)cc>, Andrew Dunstan <andrew(at)dunslane(dot)net>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Dave Page <dpage(at)pgadmin(dot)org>
Subject: Re: problems on Solaris
Date: 2015-05-27 19:39:14
Message-ID: CA+TgmoajRM0RJePuDxw2FK1Gts4gMAgVmbQ+9tHszYr4UsomEw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Mon, May 25, 2015 at 10:05 PM, Andres Freund <andres(at)anarazel(dot)de> wrote:
> Hm. So we have a *occasional* stack size exceeded failure and an
> occasional spinlock error in test_shm_mq. I'm inclined to think that
> this is a shm_mq problem, and not a more general locking problem - it
> seems likely, but not guaranteed, that that'd have materialized
> elsewhere.

I think the problem might be that the spinlock-based memory barrier is
not re-entrant. Suppose some kind of barrier operation is in process,
and we've acquired the dummy spnlock but not yet released it. Just
then, we receive a signal. Since the shm_mq code sets
set_latch_on_sigusr1, procsignal_sigusr1_handler will set MyLatch.
SetLatch now includes barrier operations, so we'll try to acquire and
release the spinlock despite already holding it. Oops.

> Robert: IIRC there was some problems with shm_mq tests being stuck
> before, right?

The last round of investigation, on anole, resulted in this fix:

commit d0410d66037c2f3f9bee45e0a2db9e47eeba2bb4
Author: Robert Haas <rhaas(at)postgresql(dot)org>
Date: Sat Oct 4 21:25:41 2014 -0400

Eliminate one background-worker-related flag variable.

Teach sigusr1_handler() to use the same test for whether a worker
might need to be started as ServerLoop(). Aside from being perhaps
a bit simpler, this prevents a potentially-unbounded delay when
starting a background worker. On some platforms, select() doesn't
return when interrupted by a signal, but is instead restarted,
including a reset of the timeout to the originally-requested value.
If signals arrive often enough, but no connection requests arrive,
sigusr1_handler() will be executed repeatedly, but the body of
ServerLoop() won't be reached. This change ensures that, even in
that case, background workers will eventually get launched.

This is far from a perfect fix; really, we need select() to return
control to ServerLoop() after an interrupt, either via the self-pipe
trick or some other mechanism. But that's going to require more
work and discussion, so let's do this for now to at least mitigate
the damage.

Per investigation of test_shm_mq failures on buildfarm member anole.

The problem here isn't really with test_shm_mq; it's with the
postmaster. To really make this work properly, we need to be able to
use latches in the postmaster, and we need to generalize
WaitLatchOrSocket so that it can wait for a latch of any of n sockets.
Then ServerLoop can use that instead of calling select directly. This
will probably look a lot like what you did to get rid of
ImmediateInterruptOK.

But all of that seems unrelated to the current problems.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Robert Haas 2015-05-27 19:40:41 Re: rhel6 rpm file locations
Previous Message Ted Toth 2015-05-27 19:25:46 Re: rhel6 rpm file locations