Active zombies at AIX

From: Konstantin Knizhnik <k(dot)knizhnik(at)postgrespro(dot)ru>
To: PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Active zombies at AIX
Date: 2017-01-24 15:08:05
Message-ID: 06f4d085-e2a5-83a7-919a-cb5a878f9e42@postgrespro.ru
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi hackers,

Yet another story about AIX. For some reasons AIX very slowly cleaning
zombie processes.
If we launch pgbench with -C parameter then very soon limit for maximal
number of connections is exhausted.
If maximal number of connection is set to 1000, then after ten seconds
of pgbench activity we get about 900 zombie processes and it takes about
100 seconds (!)
before all of them are terminated.

proctree shows a lot of defunt processes:

[14:44:41]root(at)postgres:~ # proctree 26084446
26084446 /opt/postgresql/xlc/9.6/bin/postgres -D /postg_fs/postgresql/xlc
4784362 <defunct>
4980786 <defunct>
11403448 <defunct>
11468930 <defunct>
11993176 <defunct>
12189710 <defunct>
12517390 <defunct>
13238374 <defunct>
13565974 <defunct>
13893826 postgres: wal writer process
14024716 <defunct>
15401000 <defunct>
...
25691556 <defunct>

But ps shows that status of process is <existing>

[14:46:02]root(at)postgres:~ # ps -elk | grep 25691556

* A - 25691556 - - - - - <exiting>

Breakpoint set in reaper() function in postmaster shows that each
invocation of this functions (called by SIGCHLD handler) proceed 5-10
PIDS per invocation.
So there are two hypothesis: either AIX is very slowly delivering
SIGCHLD to parent, either exit of process takes too much time.

The fact the backends are in exiting state makes second hypothesis more
reliable.
We have tried different Postgres configurations with local and TCP
sockets, with different amount of shared buffers and built both with gcc
and xlc.
In all cases behavior is similar: zombies do not want to die.
As far as it is not possible to attach debugger to defunct process, it
is not clear how to understand what's going on.

I wonder if somebody has encountered similar problems at AIX and may be
can suggest some solution to solve this problem.
Thanks in advance

--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Peter Eisentraut 2017-01-24 15:08:21 Re: [COMMITTERS] pgsql: Add pg_sequence system catalog
Previous Message Tom Lane 2017-01-24 15:00:11 Re: Assignment of valid collation for SET operations on queries with UNKNOWN types.