Re: Postgres 7.4.7 hang in async_notify

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: pgsql-bugs(at)counterstorm(dot)com
Cc: pgsql-bugs(at)postgresql(dot)org
Subject: Re: Postgres 7.4.7 hang in async_notify
Date: 2005-06-02 18:21:25
Message-ID: 26109.1117736485@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

pgsql-bugs(at)counterstorm(dot)com writes:
> We saw the problem with async_notify again (See thread with subject
> "Postgres 7.4.6 hang in async_notify" in first message to this list
> dated "Mon, 25 Apr 2005 15:42:35 -0400") in a production setting.
> Since our last instance, we converted to compiling postgres with
> debugging, so we have a stack trace. Looking at it, the problem
> appears at first blush like it might be pretty obvious: an ill-timed
> signal which arrives during a malloc while malloc has some
> data-structure locked, and one of the extensive operations that
> Async_NotifyHandler did probably involved getting the same lock.

So it would seem. The Async_NotifyHandler mechanism was designed at a
time when ReadCommand didn't call anything of interest except read(),
and so the assumption is that it's OK for PostgresMain to do this
(oversimplified a bit):

EnableNotifyInterrupt();

firstchar = ReadCommand(&input_message);

DisableNotifyInterrupt();

Clearly, if SSL is going to be messing about with malloc() then this
assumption is no longer safe. Looking at the code, I think we have
introduced some other risks of the same ilk ourselves, but SSL is
doubtless the largest variable. This probably explains a number of
other irreproducible failures besides your hangup :-(

I think we're going to have to push the enable/disable interrupt
operations down closer to the actual read(). This doesn't seem to
be any big deal for the non-SSL case, but it's not clear to me what
we have to do to get between SSL and the socket. Anyone know offhand?

> For the record, while this postgres should be (of two) generating
> notifies out of triggers, we do not believe it should be listening for
> any, and indeed examination of pg_listener suggests it does not.

Doesn't matter --- 7.4 uses the same mechanism for SI messaging catchup
interrupts. A backend that sits idle long enough *will* get one of
these interrupts. Apparently you've managed to set up a situation where
the client starts doing something after just-the-right-delay with better
than nil probability.

regards, tom lane

In response to

Browse pgsql-bugs by date

  From Date Subject
Next Message Mauro Delfino 2005-06-02 19:22:07 BUG #1698: Different behavior in UNIQUE and DISTINCT
Previous Message pgsql-bugs 2005-06-02 17:47:02 Postgres 7.4.7 hang in async_notify