Postgres 7.4.6 hang in async_notify

From: pgsql-bugs(at)counterstorm(dot)com
To: pgsql-bugs(at)postgresql(dot)org
Subject: Postgres 7.4.6 hang in async_notify
Date: 2005-04-25 19:42:35
Message-ID: 200504251942.j3PJga718714@winwood.sysdetect.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs


We have seen this twice. The first time, it was on a 32 bit Fedora
Core 3 system during a `postgres stop`. The stop hung because one of
the processes was in async_notify and refused to get out. We were
unable to do further debugging because the db was in shutdown. We
were forced to kill -9 the process.

The second time we saw this, today, was on a 64 bit Opteron Fedora
Core 3 system during operation. One of our postgres processes was
hung in async_notify causing all further processing by that process to
be stopped. All other processes were fine, including ones which used
notify. After a variety of debugging and shutdown attempts (described
in further detail below), we were forced to kill -9 the process

Unfortunately, we do not know how to reproduce this (though we are
willing to try if anyone has any ideas). We are obviously looking for
a patch, a recovery method more subtle than kill -9, or at least
something to do the next time we see this problem to help further
debugging. A description of what happened, a dump of pg_locks, some
`ps` examples, and some final postgres shutdown messages are included
below. Any attention you could spare would be most appreciated.

The way the notify gets called is a bit convoluted. Our C program
performs an insert in table T. Some triggers run doing all sorts of
stuff in the background involving changes to other tables. Under some
circumstances, a row is inserted by the trigger into table W (a
trigger off of that is executed) and the parent trigger performs a
notify. This, while expensive, normally all works. However, at a
customer site we saw the postgres process hang out in async_notify and
refuse to allow more inserts.

As far as we know, there was a listener (a perl process) for that
notify (we failed to look at pg_listeners) but we were able to use
psql to listen on that value and notify on it as well. We doubt the
notify rate was high enough to cause hundreds or even dozens of
notifies to be outstanding, but it is certainly possible that a
handful were.

After struggling with this for a bit (which included sending the hung
process a SIGUSR2 because of some emails on this list which seemed to
suggest a relationship between the signal and notifies), we attempted
to stop the system. We killed off all postgres initiators (including
the process doing the insert which caused it to hang in async_notify)
but the program remained hung. We initiated a database shutdown
(whapped the postmaster with a SIGTERM) but the postgres remained in
async_notify. We tried directly sending the hung process a SIGTERM
but that did not help. Finally, we were forced to send the process
the uncatchable signal SIGKILL, which obviously worked.

psql=# select * from pg_locks;
relation | database | transaction | pid | mode | granted
----------+----------+-------------+-------+-----------------+---------
| | 565845 | 11603 | ExclusiveLock | t
16759 | 17212 | | 469 | AccessShareLock | t
| | 2153119 | 469 | ExclusiveLock | t

different psql (pid 1013)

psql=# select * from pg_locks;
relation | database | transaction | pid | mode | granted
----------+----------+-------------+-------+-----------------+---------
| | 565845 | 11603 | ExclusiveLock | t
| | 2154368 | 1013 | ExclusiveLock | t
16759 | 17212 | | 1013 | AccessShareLock | t

----------------------------------------------------------------------
26535 ? S 0:08 postmaster -i -D postgres/data
26536 ? S 0:47 postgres: stats buffer process
26537 ? S 0:27 postgres: stats collector process
26551 ? Ss 0:00 initlog -c pg_autovacuum -U user1
26599 ? S 0:19 pg_autovacuum -U user1
29684 ? S 16:31 postgres: sysd sysd 10.139.140.58 idle
30881 ? S 0:00 postgres: sysd sysd 10.139.140.58 idle
10549 ? S 0:11 postgres: sysd sysd 10.139.140.58 idle
10626 ? S 0:03 postgres: sysd sysd 10.139.140.58 idle
10668 ? S 1:53 postgres: sysd sysd 10.139.140.58 idle
11178 ? S 0:00 postgres: sysd sysd 10.139.140.58 idle
11258 ? S 0:00 postgres: sysd sysd 10.139.140.58 idle
11603 ? S 1:14 postgres: sysd sysd 10.139.140.58 async_notify
11718 ? S 0:00 postgres: sysd sysd 10.139.140.58 idle
32506 pts/3 T 0:00 psql
32553 ? S 0:00 postgres: sysd sysd 10.139.140.58 idle
32649 pts/3 T 0:00 psql -U user1
32696 ? S 0:00 postgres: user1 sysd 10.139.140.58 idle
374 ? S 0:00 postgres: sysd sysd 10.139.140.58 idle
389 pts/3 S+ 0:00 grep postgres
----------------------------------------------------------------------

After killing off *all* postgres initiators, including the parent of
11603.

----------------------------------------------------------------------
26535 ? S 0:08 postmaster -i -D postgres/data
26536 ? S 0:47 postgres: stats buffer process
26537 ? S 0:28 postgres: stats collector process
11603 ? S 1:14 postgres: sysd sysd 10.139.140.58 async_notify
----------------------------------------------------------------------

Apr 25 11:01:02 localhost postgres[26535]: [5-1] LOG: server process (PID 11603) was terminated by signal 9
Apr 25 11:01:02 localhost postgres[26535]: [6-1] LOG: terminating any other active server processes
Apr 25 11:01:02 localhost postgres[26535]: [7-1] LOG: all server processes terminated; reinitializing
Apr 25 11:01:02 localhost postgres[1947]: [8-1] LOG: database system was interrupted at 2005-04-25 10:56:05 EDT
Apr 25 11:01:02 localhost postgres[1947]: [9-1] LOG: checkpoint record is at 0/153152B8
Apr 25 11:01:02 localhost postgres[1947]: [10-1] LOG: redo record is at 0/153152B8; undo record is at 0/0; shutdown FALSE
Apr 25 11:01:02 localhost postgres[1947]: [11-1] LOG: next transaction ID: 2156499; next OID: 66294
Apr 25 11:01:02 localhost postgres[1947]: [12-1] LOG: database system was not properly shut down; automatic recovery in progress
Apr 25 11:01:02 localhost postgres[1947]: [13-1] LOG: record with zero length at 0/15315300
Apr 25 11:01:02 localhost postgres[1947]: [14-1] LOG: redo is not required
Apr 25 11:01:04 localhost postgres[1947]: [15-1] LOG: database system is ready
Apr 25 11:01:04 localhost postgres[1949]: [8-1] LOG: shutting down
Apr 25 11:01:06 localhost postgres[1949]: [9-1] LOG: database system is shut do

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message Juan Villa 2005-04-25 20:08:16 BUG #1625: INTALL PROBLEM
Previous Message Kris Jurka 2005-04-25 17:09:59 Re: BUG #1624: connect uses properties before URL