Re: BUG #3504: Some listening sessions never return from writing, problems ensue

From: "Peter Koczan" <pjkoczan(at)gmail(dot)com>
To: "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: "Heikki Linnakangas" <heikki(at)enterprisedb(dot)com>, pgsql-bugs(at)postgresql(dot)org
Subject: Re: BUG #3504: Some listening sessions never return from writing, problems ensue
Date: 2007-08-07 01:05:37
Message-ID: 4544e0330708061805wcafa72br623aad2e50fe250b@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

Not quite, here's what happens every second when idling, for both good and
bad connections (with a different values to each syscall). It is trying to
read data, I am a little concerned about the EAGAIN return value, though.

[koczan(at)ator] ~ $ strace -p 6987
poll([{fd=3, events=POLLIN}, {fd=4, events=POLLIN|POLLPRI}, {fd=8,
events=POLLIN}], 3, 869) = 0
gettimeofday({1186445557, 631779}, NULL) = 0
read(5, 0x9651540, 5) = -1 EAGAIN (Resource temporarily
unavailable)
gettimeofday({1186445557, 632197}, NULL) = 0
ioctl(3, FIONREAD, [0]) = 0
poll(<unfinished ...>
Process 6987 detached

However, looking at straces when an async notify actually happens, I notice
something interesting. The number of syscalls over the same interrupt on the
bad connection is nearly 5 times that of the good connection. By the time
the good connection is finished, the bad connection is still mucking with
data. I can send you the actual straces if you think it would help.

Here's my theory (and feel free to tell me that I'm full of it)...somehow, a
lot of notifies happened at once, or in a very short period of time, to the
point where the app was still processing notifies when the timer clicked off
another second. The connection (or app, or perl module) never marked those
notifies as being processed, or never updated its timestamp of when it
finished, so when the next notify came around, it tried to reprocess the old
data (or data since the last time it finished), and yet again couldn't
finish. Lather, rinse, repeat. In sum, it might be that trying to call
pg_notifies while processing notifies tickles a race condition and tricks
the connection into thinking its in a bad state.

The holes in my theory are that the app doesn't try to read everything every
second for bad connections (so notifies are being processed somehow), and it
doesn't explain why the recv-q is filling up even for good connections.
Hopefully this will help somehow, though.

Peter

P.S. Just in case its useful, the contents of pg_listener for bad
connections (I verified that notification = 0 for good connections).
csdb=# select * from pg_listener where notification > 0;
relname | listenerpid | notification
---------+-------------+--------------
req | 3292 | 19141
req | 25848 | 16308
req | 17004 | 23815
req | 19583 | 1703
(4 rows)

On 8/6/07, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>
> "Peter Koczan" <pjkoczan(at)gmail(dot)com> writes:
> > So, it looks like the app polls for async notifies every second.
> Hopefully
> > you can glean some insight from that.
>
> My guess is that some condition causes that to stop happening.
>
> One way to monitor this from outside is to use strace to watch the app.
> According to the above you should see a recv() attempt on the libpq
> socket every second. I suspect that you won't see any more once the app
> gets into the state where the recv-q is backing up. Looking at what
> happens around the time of the last recv() might give some clues.
>
> regards, tom lane
>

In response to

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message Gregory Stark 2007-08-07 01:13:18 Re: RE : RE : BUG #3519: Postgres takes the wrong query plan resulting in performance issues
Previous Message Tom Lane 2007-08-06 22:36:16 Re: BUG #3504: Some listening sessions never return from writing, problems ensue