Re: SIGUSR1 pingpong between master na autovacum launcher causes crash

From: Zdenek Kotala <Zdenek(dot)Kotala(at)Sun(dot)COM>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Alvaro Herrera <alvherre(at)commandprompt(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: SIGUSR1 pingpong between master na autovacum launcher causes crash
Date: 2009-08-24 11:47:27
Message-ID: 1251114447.3252.16.camel@localhost
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers


Tom Lane píše v so 22. 08. 2009 v 09:56 -0400:
> Zdenek Kotala <Zdenek(dot)Kotala(at)Sun(dot)COM> writes:
> > There are most important records from yesterdays issues.
> > Messages:
> > ---------
> > Aug 20 11:14:54 genunix: [ID 470503 kern.warning] WARNING: Sorry, no swap space to grow stack for pid 507 (postgres)
>
> Hmm, that seems to confirm the idea that something had run the machine
> out of memory/swap space, which would explain the repeated ENOMEM fork
> failures. But we're still no closer to understanding how come the
> delay in the avlauncher didn't do what it was supposed to.

I found hungry process which eats up all memory and fortunately it is
not postgres :-).

I run also following dtrace script:

dtrace -n 'syscall::kill:entry / execname=="postgres"/ { printf("%i %
s, %i->%i : %i", timestamp, execname, pid, arg0, arg1); }'

and it show following (little bit modified) output:

<snip>
CPU Timestamp[ns] diff[ms] caller callee sig
0 2750745000052090 899,96 28604 -> 28608 16
3 2750745100280460 100,23 28608 -> 28604 16
1 2750746000144690 899,86 28604 -> 28608 16
3 2750746100380940 100,24 28608 -> 28604 16
2 2750747000135380 899,75 28604 -> 28608 16
3 2750747100171650 100,04 28608 -> 28604 16
0 2750748000101050 899,93 28604 -> 28608 16
3 2750748100331900 100,23 28608 -> 28604 16
1 2750749000148550 899,82 28604 -> 28608 16
3 2750749100386640 100,24 28608 -> 28604 16
2 2750750000095040 899,71 28604 -> 28608 16
3 2750750100127780 100,03 28608 -> 28604 16

We can see there that AVlauncher really wait 100ms, but it is not enough
when system is under stress.

I tested Alvaro's patch and it works, because it does not lead to stack
consumption, but it shows another bug in StartAutovacuumWorker() code.
When fork fails bn structure is freed but
ReleasePostmasterChildSlot() should be called as well. See error:

2009-08-24 11:50:20.360 CEST 3468 FATAL: no free slots in PMChildFlags array

and comment in source code:

/* Out of slots ... should never happen, else postmaster.c messed up */

I think that Alvaro's patch is good and it fix a crash problem. I also
think that AVlauncher could wait little bit more then 100ms. When system
cannot fork, I don't see any reason why hurry to repeat a fork
operation. Maybe 1s is good compromise.

Zdenek

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Zdenek Kotala 2009-08-24 12:36:27 Re: SIGUSR1 pingpong between master na autovacum launcher causes crash
Previous Message Peter Eisentraut 2009-08-24 11:39:21 Re: 8.5 release timetable, again