|From:||Konstantin Knizhnik <k(dot)knizhnik(at)postgrespro(dot)ru>|
|Subject:||Re: Active zombies at AIX|
|Views:||Raw Message | Whole Thread | Download mbox | Resend email|
I tried to rebuild Postgres without mmap and the problem disappears
(pgbench with -C doesn't cause connection limit exhaustion any more).
Unfortunately there is no any convenient way to configure PostgreSQL not
to use mmap.
I have to edit sysv_shmem.c file, commenting
I wonder why do we prohibit now configuration of Postgres without mmap?
On 06.02.2017 12:47, Konstantin Knizhnik wrote:
> Last update on the problem.
> Using kdb tool (thank's to Tony Reix for advice and help) we get the
> following trace of Poastgres backend in existing stack trace:
> pvthread+073000 STACK:
> [005E1958]slock+000578 (00000000005E1958, 8000000000001032 [??])
> .simple_lock+000058 ()
> [00651DBC]vm_relalias+00019C (??, ??, ??, ??, ??)
> [006544AC]vm_map_entry_delete+00074C (??, ??, ??)
> [00659C30]vm_map_delete+000150 (??, ??, ??, ??)
> [00659D88]vm_map_deallocate+000048 (??, ??)
> [0011C588]kexitx+001408 (??)
> [000BB08C]kexit+00008C ()
> ___ Recovery (FFFFFFFFFFF9290) ___
> WARNING: Eyecatcher/version mismatch in RWA
> So there seems to be lock contention while unmapping memory segments.
> My assumption was that Postgres is detaching all attached segments
> before exit (in shmem_exit callback or earlier).
> I have added logging around proc_exit_prepare function (which is
> called by atexit callback) and check that it completes immediately.
> So I thought that this vm_map_deallocate can be related with
> deallocation of normal (malloced) memory, because in Linux memory
> allocator may use mmap.
> But in AIX it is not true.
> Below is report of Bergamini Demien (once again a lot of thanks for
> help with investigation the problem):
> The memory allocator in AIX libc does not use mmap and vm_relalias()
> is only called for shared memory mappings.
> I talked with the AIX VMM expert at IBM and he said that what you hit
> is one of the most common performance bottlenecks in AIX memory
> He also said that SysV Shared Memory (shmget/shmat) perform better on
> AIX than mmap.
> Some improvements have been made in AIX 6.1 (see “perf suffers when
> procs sharing the same segs all exit at once”:
> http://www-01.ibm.com/support/docview.wss?uid=isg1IZ83819) but it does
> not help in your case.
> In src/backend/port/sysv_shmem.c, it says that PostgreSQL 9.3 switched
> from using SysV Shared Memory to using mmap.
> Maybe you could try to switch back to using SysV Shared Memory on AIX
> to see if it helps performance-wise.
> Also, the good news is that there are some restricted tunables in AIX
> that can be tweaked to help different workloads which may have
> different demands.
> One of them is relalias_percentage which works with force_relalias_lite:
> # vmo -h relalias_percentage
> Help for tunable relalias_percentage:
> If force_relalias_lite is set to 0, then this specifies the factor
> used in the heuristic to decide whether to avoid locking the source
> mmapped segment or not.
> Default: 0
> Range: 0 - 32767
> Type: Dynamic
> This is used when tearing down an mmapped region and is a scalability
> statement, where avoiding the lock may help system throughput, but, in
> some cases, at the cost of more compute time used. If the number of
> pages being unmapped is less than this value divided by 100 and
> multiplied by the total number of pages in memory in the source
> mmapped segment, then the source lock will be avoided. A value of 0
> for relalias_percentage, with force_relalias_lite also set to 0, will
> cause the source segment lock to always be taken. Effective values for
> relalias_percentage will vary by workload, however, a suggested value
> is: 200.
> You may also try to play with the munmap_npages vmo tunable.
> Your vmo settings for lgpg_size, lgpg_regions and v_pinshm already
> seem correct.
> On 24.01.2017 18:08, Konstantin Knizhnik wrote:
>> Hi hackers,
>> Yet another story about AIX. For some reasons AIX very slowly
>> cleaning zombie processes.
>> If we launch pgbench with -C parameter then very soon limit for
>> maximal number of connections is exhausted.
>> If maximal number of connection is set to 1000, then after ten
>> seconds of pgbench activity we get about 900 zombie processes and it
>> takes about 100 seconds (!)
>> before all of them are terminated.
>> proctree shows a lot of defunt processes:
>> [14:44:41]root(at)postgres:~ # proctree 26084446
>> 26084446 /opt/postgresql/xlc/9.6/bin/postgres -D /postg_fs/postgresql/xlc
>> 4784362 <defunct>
>> 4980786 <defunct>
>> 11403448 <defunct>
>> 11468930 <defunct>
>> 11993176 <defunct>
>> 12189710 <defunct>
>> 12517390 <defunct>
>> 13238374 <defunct>
>> 13565974 <defunct>
>> 13893826 postgres: wal writer process
>> 14024716 <defunct>
>> 15401000 <defunct>
>> 25691556 <defunct>
>> But ps shows that status of process is <existing>
>> [14:46:02]root(at)postgres:~ # ps -elk | grep 25691556
>> * A - 25691556 - - - - - <exiting>
>> Breakpoint set in reaper() function in postmaster shows that each
>> invocation of this functions (called by SIGCHLD handler) proceed 5-10
>> PIDS per invocation.
>> So there are two hypothesis: either AIX is very slowly delivering
>> SIGCHLD to parent, either exit of process takes too much time.
>> The fact the backends are in exiting state makes second hypothesis
>> more reliable.
>> We have tried different Postgres configurations with local and TCP
>> sockets, with different amount of shared buffers and built both with
>> gcc and xlc.
>> In all cases behavior is similar: zombies do not want to die.
>> As far as it is not possible to attach debugger to defunct process,
>> it is not clear how to understand what's going on.
>> I wonder if somebody has encountered similar problems at AIX and may
>> be can suggest some solution to solve this problem.
>> Thanks in advance
>> Konstantin Knizhnik
>> Postgres Professional:http://www.postgrespro.com
>> The Russian Postgres Company
> Konstantin Knizhnik
> Postgres Professional:http://www.postgrespro.com
> The Russian Postgres Company
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company
|Next Message||Ruben Buchatskiy||2017-02-06 11:51:57||Re: GSoC 2017|
|Previous Message||Nikita Glukhov||2017-02-06 11:27:03||Re: [PATCH] kNN for SP-GiST|