Re: BUG #3504: Some listening sessions never return from writing, problems ensue

From: "Peter Koczan" <pjkoczan(at)gmail(dot)com>
To: "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: "Heikki Linnakangas" <heikki(at)enterprisedb(dot)com>, pgsql-bugs(at)postgresql(dot)org
Subject: Re: BUG #3504: Some listening sessions never return from writing, problems ensue
Date: 2007-08-10 15:59:15
Message-ID: 4544e0330708100859ibd71a7brddf224669bd58eab@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

On 8/9/07, Peter Koczan <pjkoczan(at)gmail(dot)com> wrote:
> On 8/6/07, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> > "Peter Koczan" <pjkoczan(at)gmail(dot)com> writes:
> > > Here's my theory (and feel free to tell me that I'm full of it)...somehow, a
> > > lot of notifies happened at once, or in a very short period of time, to the
> > > point where the app was still processing notifies when the timer clicked off
> > > another second. The connection (or app, or perl module) never marked those
> > > notifies as being processed, or never updated its timestamp of when it
> > > finished, so when the next notify came around, it tried to reprocess the old
> > > data (or data since the last time it finished), and yet again couldn't
> > > finish. Lather, rinse, repeat. In sum, it might be that trying to call
> > > pg_notifies while processing notifies tickles a race condition and tricks
> > > the connection into thinking its in a bad state.
> >
> > Hmm. Is the app trying to do this processing inside an interrupt
> > service routine (a/k/a signal handler)? If so, and if the ISR can
> > interrupt itself, then you've got a problem because you'll be doing
> > reentrant calls of libpq, which it doesn't support. You can only make
> > that work if the handler blocks further occurrences of its signal until
> > it finishes.
> >
>
> I'm not entirely sure if this answers your question, but here's what I
> found out from the primary maintainer of the app. Note that
> update_reqs is the function calling pg_notifies. If there's more
> information I can provide or another test we can run, please let me
> know.
>
> ------- BEGIN MESSAGE -------
> I just checked and the timer won't interrupt update_reqs, so we'll
> have to look for another solution. Anyway, update_reqs doesn't do
> anything with the database except for checking for a notify, so I
> don't see where it can be interrupted to cause DB problems.
> ------- END MESSAGE -------
>
> I also found out that one notify gets sent per action (not per batch
> of actions), so if n requests get resolved at once, n notifies are
> sent, not 1. In theory this could mitigate this problem, but I don't
> know how easy it is at this point. Still, it doesn't explain how or
> why the client's recv-q isn't getting cleared.
>
> Hope this helps.
>

On our end, we changed the the code in the function calling
pg_notifies to use an if statement rather than a while (that way it
only updates once per second instead of continuously as long as there
are pending async notifies).

I looked more closely at the docs for DBD::Pg, and the pg_notifies
call grabs *all* pending async notifies and returns them in a hash,
not just one at a time. So, what was happening before was that if a
new notify came through while processing the previous notifies, the
code would reprocess. Lather, rinse, repeat. I think that if the
program is waiting for pg_notifies when the timer interrupts it again,
causing the client to call pg_notifies while still waiting for
something. I think this is what gets the listening connection into the
bad state.

In theory this change should mitigate the "notify interrupt" behavior
on our end, but, again, why the client's recv-q is filling up is as
yet unexplained.

Peter

P.S. In src/backend/commands/async.c, somewhere between lines 910 and
981 (set_ps_display calls) is where the code gets interrupted. How and
why, I don't know.

In response to

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message James William Pye 2007-08-10 18:26:55 BUG #3532: Can't rollup array of arrays
Previous Message Heikki Linnakangas 2007-08-10 14:11:15 Re: failed to re-find parent key in "..." for deletion target page