Re: Autovacuum seems to block database: WARNING worker took too long to start

From: Pablo Delgado Díaz-Pache <delgadop(at)gmail(dot)com>
To: Alvaro Herrera <alvherre(at)commandprompt(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-admin <pgsql-admin(at)postgresql(dot)org>
Subject: Re: Autovacuum seems to block database: WARNING worker took too long to start
Date: 2010-11-18 11:57:16
Message-ID: AANLkTinGUsJ31au72wZEbQeWkXbcCGd6MsugMnPQ0hX7@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-admin

Well, we had the problem again. This is what we did

1) A "ps" to check what processes were running. Many "postmaster" processes
in defunct state. A few postgres connections still working. A few "psql" (by
shell script) queries hanged (they are scripts to monitor state of database.
We use big-sister to do so).

The normal state of the server is 1 postmaster pid and many "postgres"
connections in "idle" state.

This the result when we were having the problem. I've excluded irrelevant
processes.

*postgres 892 3889 0 09:07 ? 00:01:05 postgres: postgres
international 10.19.0.51(49264) idle*
*postgres 934 3889 0 12:00 ? 00:00:04 [postmaster] <defunct>*
*postgres 935 3889 0 12:00 ? 00:00:04 [postmaster] <defunct>*
*postgres 936 3889 0 12:00 ? 00:00:00 [postmaster] <defunct>*
*postgres 938 3889 0 12:00 ? 00:00:01 [postmaster] <defunct>*
*postgres 1244 3889 0 12:01 ? 00:00:01 [postmaster] <defunct>*
*postgres 1453 3889 0 12:02 ? 00:00:02 [postmaster] <defunct>*
*postgres 1563 3889 0 12:03 ? 00:00:10 [postmaster] <defunct>*
*postgres 1862 3889 0 12:04 ? 00:00:04 [postmaster] <defunct>*
*postgres 2275 3889 0 12:06 ? 00:00:02 [postmaster] <defunct>*
*root 2287 49 0 Oct29 ? 00:00:00 [kondemand/0]*
*root 2288 49 0 Oct29 ? 00:00:00 [kondemand/1]*
*postgres 2485 3889 0 12:07 ? 00:00:05 postgres: postgres
international 10.19.0.51(59066) idle*
*postgres 2490 3889 1 12:07 ? 00:00:26 postgres: postgres
international 10.19.0.51(59070) idle*
*postgres 2568 3889 0 12:07 ? 00:00:00 postgres: postgres
international 10.19.0.50(1249) idle*
*postgres 3544 3889 0 12:12 ? 00:00:07 [postmaster] <defunct>*
*postgres 3875 3889 0 12:14 ? 00:00:03 [postmaster] <defunct>*
*postgres 3889 1 0 Nov16 ? 00:01:24 /usr/bin/postmaster -p 5432
-D /var/lib/pgsql/data*
*postgres 3896 3889 0 Nov16 ? 00:00:00 postgres: logger process*
*postgres 3913 3889 0 Nov16 ? 00:00:05 postgres: writer process*
*postgres 3914 3889 0 Nov16 ? 00:00:00 postgres: wal writer
process*
*postgres 3915 3889 0 Nov16 ? 00:00:01 postgres: autovacuum
launcher process*
*postgres 3916 3889 0 Nov16 ? 00:01:42 postgres: stats collector
process*
*postgres 4169 3889 0 12:15 ? 00:00:02 [postmaster] <defunct>*
*postgres 4170 3889 0 12:15 ? 00:00:04 [postmaster] <defunct>*
*postgres 4171 3889 0 12:15 ? 00:00:07 [postmaster] <defunct>*
*postgres 4281 3889 0 12:15 ? 00:00:04 [postmaster] <defunct>*
*postgres 4672 3889 0 12:17 ? 00:00:01 [postmaster] <defunct>*
*postgres 5038 3889 0 12:19 ? 00:00:01 [postmaster] <defunct>*
*postgres 5203 3889 0 12:20 ? 00:00:00 [postmaster] <defunct>*
*postgres 5484 3889 0 12:21 ? 00:00:01 [postmaster] <defunct>*
*postgres 5694 3889 0 12:22 ? 00:00:01 postgres: postgres
international 10.19.0.49(29113) idle*
*postgres 5769 3889 0 12:23 ? 00:00:04 [postmaster] <defunct>*
*postgres 5832 3889 0 12:23 ? 00:00:03 postgres: postgres
international 10.19.0.49(29129) idle*
*postgres 5835 3889 0 12:23 ? 00:00:02 [postmaster] <defunct>*
*postgres 5979 3889 0 12:24 ? 00:00:01 [postmaster] <defunct>*
*postgres 6243 3889 0 12:25 ? 00:00:00 [postmaster] <defunct>*
*postgres 6424 3889 0 12:26 ? 00:00:01 postgres: postgres
international 10.19.0.49(45160) idle*
*postgres 6425 3889 0 12:26 ? 00:00:01 [postmaster] <defunct>*
*postgres 6429 3889 0 12:26 ? 00:00:01 [postmaster] <defunct>*
*postgres 6599 3889 0 12:27 ? 00:00:02 [postmaster] <defunct>*
*postgres 6635 3889 0 12:27 ? 00:00:00 [postmaster] <defunct>*
*postgres 6653 3889 0 12:27 ? 00:00:00 [postmaster] <defunct>*
*postgres 6654 3889 0 12:27 ? 00:00:00 [postmaster] <defunct>*
*postgres 6655 3889 0 12:27 ? 00:00:00 [postmaster] <defunct>*
*postgres 6663 3889 0 12:27 ? 00:00:00 [postmaster] <defunct>*
*postgres 6664 3889 0 12:27 ? 00:00:00 [postmaster] <defunct>*
*postgres 6665 3889 0 12:27 ? 00:00:00 [postmaster] <defunct>*
*postgres 6666 3889 0 12:27 ? 00:00:00 [postmaster] <defunct>*
*postgres 6667 3889 0 12:27 ? 00:00:00 [postmaster] <defunct>*
*postgres 6668 3889 0 12:27 ? 00:00:00 [postmaster] <defunct>*
*postgres 6669 3889 0 12:27 ? 00:00:00 [postmaster] <defunct>*
*postgres 6670 3889 0 12:27 ? 00:00:00 [postmaster] <defunct>*
*postgres 6671 3889 0 12:27 ? 00:00:00 [postmaster] <defunct>*
*postgres 6672 3889 0 12:27 ? 00:00:00 [postmaster] <defunct>*
*postgres 6673 3889 0 12:27 ? 00:00:00 [postmaster] <defunct>*
*postgres 6773 3889 0 12:28 ? 00:00:00 [postmaster] <defunct>*
*postgres 6774 3889 0 12:28 ? 00:00:00 [postmaster] <defunct>*
*postgres 6778 3889 0 12:28 ? 00:00:01 [postmaster] <defunct>*
*postgres 6779 3889 0 12:28 ? 00:00:00 [postmaster] <defunct>*
*postgres 6785 3889 0 12:28 ? 00:00:00 [postmaster] <defunct>*
*postgres 6787 3889 0 12:28 ? 00:00:00 [postmaster] <defunct>*
*postgres 6788 3889 0 12:28 ? 00:00:00 [postmaster] <defunct>*
*postgres 6796 3889 0 12:28 ? 00:00:00 [postmaster] <defunct>*
*postgres 6797 3889 0 12:28 ? 00:00:00 [postmaster] <defunct>*
*postgres 6837 3889 0 12:28 ? 00:00:00 [postmaster] <defunct>*
*postgres 6838 3889 0 12:28 ? 00:00:00 [postmaster] <defunct>*
*postgres 6839 3889 0 12:28 ? 00:00:00 [postmaster] <defunct>*
*postgres 6840 3889 0 12:28 ? 00:00:00 [postmaster] <defunct>*
*postgres 6841 3889 0 12:28 ? 00:00:00 [postmaster] <defunct>*
*postgres 6842 3889 0 12:28 ? 00:00:00 [postmaster] <defunct>*
*postgres 6843 3889 0 12:28 ? 00:00:00 [postmaster] <defunct>*
*postgres 6844 3889 0 12:28 ? 00:00:00 [postmaster] <defunct>*
*postgres 6852 3889 0 12:28 ? 00:00:00 [postmaster] <defunct>*
*postgres 6859 3889 0 12:28 ? 00:00:00 [postmaster] <defunct>*
*postgres 6860 3889 0 12:28 ? 00:00:00 [postmaster] <defunct>*
*postgres 6862 3889 0 12:28 ? 00:00:00 [postmaster] <defunct>*
*postgres 6872 3889 0 12:28 ? 00:00:00 [postmaster] <defunct>*
*postgres 6873 3889 0 12:28 ? 00:00:00 [postmaster] <defunct>*
*postgres 6874 3889 0 12:28 ? 00:00:00 [postmaster] <defunct>*
*postgres 6910 3889 0 12:29 ? 00:00:00 [postmaster] <defunct>*
*postgres 6970 3889 0 12:29 ? 00:00:00 [postmaster] <defunct>*
*postgres 6971 3889 0 12:29 ? 00:00:00 [postmaster] <defunct>*
*postgres 6972 3889 0 12:29 ? 00:00:00 [postmaster] <defunct>*
*postgres 6973 3889 0 12:29 ? 00:00:00 [postmaster] <defunct>*
*postgres 6978 3889 0 12:29 ? 00:00:00 [postmaster] <defunct>*
*postgres 6990 3889 0 12:29 ? 00:00:00 [postmaster] <defunct>*
*postgres 6991 3889 0 12:29 ? 00:00:00 [postmaster] <defunct>*
*postgres 6992 3889 0 12:29 ? 00:00:00 [postmaster] <defunct>*
*postgres 6993 3889 0 12:29 ? 00:00:00 [postmaster] <defunct>*
*postgres 7000 3889 0 12:29 ? 00:00:00 [postmaster] <defunct>*
*postgres 7088 3889 0 12:29 ? 00:00:00 [postmaster] <defunct>*
*postgres 7089 3889 0 12:29 ? 00:00:00 [postmaster] <defunct>*
*postgres 7090 3889 0 12:29 ? 00:00:00 [postmaster] <defunct>*
*postgres 7091 3889 0 12:29 ? 00:00:00 [postmaster] <defunct>*
*postgres 7094 3889 0 12:29 ? 00:00:00 [postmaster] <defunct>*
*postgres 7095 3889 0 12:29 ? 00:00:00 [postmaster] <defunct>*
*postgres 7096 3889 0 12:29 ? 00:00:00 [postmaster] <defunct>*
*postgres 7122 3889 0 12:29 ? 00:00:00 [postmaster] <defunct>*
*postgres 7126 3889 0 12:29 ? 00:00:03 postgres: postgres mrs
10.19.0.51(7717) idle*
*postgres 7127 3889 0 12:29 ? 00:00:00 [postmaster] <defunct>*
*root 7152 7150 0 12:30 ? 00:00:00 /usr/bin/psql -h ddbb.priv
-U bigsister mrs*
*root 7157 7155 0 12:30 ? 00:00:00 /usr/bin/psql -h ddbb.priv
-U bigsister mrs*
*root 7168 7166 0 12:31 ? 00:00:00 /usr/bin/psql -h ddbb.priv
-U bigsister mrs*
*root 7174 7172 0 12:31 ? 00:00:00 /usr/bin/psql -h ddbb.priv
-U bigsister mrs*
*root 7188 7186 0 12:32 ? 00:00:00 /usr/bin/psql -h ddbb.priv
-U bigsister mrs*
*root 7196 7194 0 12:32 ? 00:00:00 /usr/bin/psql -h ddbb.priv
-U bigsister mrs*
*root 7258 7256 0 12:34 ? 00:00:00 /usr/bin/psql -h ddbb.priv
-U bigsister mrs*
*root 7263 7261 0 12:34 ? 00:00:00 /usr/bin/psql -h ddbb.priv
-U bigsister mrs*
*root 7308 7306 0 12:35 ? 00:00:00 /usr/bin/psql -h ddbb.priv
-U bigsister mrs*
*root 7313 7311 0 12:35 ? 00:00:00 /usr/bin/psql -h ddbb.priv
-U bigsister mrs*
*root 7325 7323 0 12:36 ? 00:00:00 /usr/bin/psql -h ddbb.priv
-U bigsister mrs*
*root 7330 7328 0 12:36 ? 00:00:00 /usr/bin/psql -h ddbb.priv
-U bigsister mrs*
*root 7371 7369 0 12:37 ? 00:00:00 /usr/bin/psql -h ddbb.priv
-U bigsister mrs*
*postgres 22959 3889 0 11:09 ? 00:00:01 postgres: postgres spain
10.19.0.45(57042) idle*
*postgres 22961 3889 0 11:09 ? 00:00:22 postgres: postgres mirai
10.19.0.45(57043) idle*
*postgres 23225 3889 0 11:10 ? 00:00:00 postgres: postgres xmas
10.19.0.45(57061) idle*
*postgres 25844 3889 0 11:23 ? 00:00:01 postgres: postgres spain
10.19.0.51(11291) idle*
*postgres 31942 3889 0 11:53 ? 00:00:00 [postmaster] <defunct>*
*postgres 31973 3889 0 11:53 ? 00:00:16 [postmaster] <defunct>*

2) We did a strace to the postmaster pid. However we had 2 postmasters not
dead

# ps -fea |grep -i postmaster
postgres 3889 1 0 Nov16 ? 00:01:24 /usr/bin/postmaster -p 5432
-D /var/lib/pgsql/data
postgres 7601 3889 0 12:37 ? 00:00:00 /usr/bin/postmaster -p 5432
-D /var/lib/pgsql/data

As soon as we did a "strace" to the 3889 pid everything started to work
again. Not sure it was a coincidence but that was the way it was.

*# strace -p 3889*
*Process 3889 attached - interrupt to quit*
*select(6, [3 4 5], NULL, NULL, {56, 930000}) = ? ERESTARTNOHAND (To be
restarted)*
*--- SIGUSR1 (User defined signal 1) @ 0 (0) ---*
*rt_sigprocmask(SIG_SETMASK, ~[ILL TRAP ABRT BUS FPE SEGV CONT SYS RTMIN
RT_1], NULL, 8) = 0*
*clone(child_stack=0, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD,
child_tidptr=0x2b4f6591d440) = 7766*
*rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0*
*rt_sigreturn(0x2) = -1 EINTR (Interrupted system
call)*
*rt_sigprocmask(SIG_SETMASK, ~[ILL TRAP ABRT BUS FPE SEGV CONT SYS RTMIN
RT_1], NULL, 8) = 0*
*rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0*
*select(6, [3 4 5], NULL, NULL, {60, 0}) = ? ERESTARTNOHAND (To be
restarted)*
*--- SIGCHLD (Child exited) @ 0 (0) ---*
*rt_sigprocmask(SIG_SETMASK, ~[ILL TRAP ABRT BUS FPE SEGV CONT SYS RTMIN
RT_1], NULL, 8) = 0*
*wait4(-1, [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], WNOHANG, NULL) = 7766*
*wait4(-1, 0x7fffc37259a4, WNOHANG, NULL) = 0*
*rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0*
*rt_sigreturn(0x2) = -1 EINTR (Interrupted system
call)*
*rt_sigprocmask(SIG_SETMASK, ~[ILL TRAP ABRT BUS FPE SEGV CONT SYS RTMIN
RT_1], NULL, 8) = 0*
*rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0*
*select(6, [3 4 5], NULL, NULL, {60, 0}) = ? ERESTARTNOHAND (To be
restarted)*
*--- SIGUSR1 (User defined signal 1) @ 0 (0) ---*
*rt_sigprocmask(SIG_SETMASK, ~[ILL TRAP ABRT BUS FPE SEGV CONT SYS RTMIN
RT_1], NULL, 8) = 0*
*clone(child_stack=0, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD,
child_tidptr=0x2b4f6591d440) = 7767*
*rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0*
*rt_sigreturn(0x2) = -1 EINTR (Interrupted system
call)*
*rt_sigprocmask(SIG_SETMASK, ~[ILL TRAP ABRT BUS FPE SEGV CONT SYS RTMIN
RT_1], NULL, 8) = 0*
*rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0*
*select(6, [3 4 5], NULL, NULL, {60, 0}) = ? ERESTARTNOHAND (To be
restarted)*
*--- SIGCHLD (Child exited) @ 0 (0) ---*

I also straced the other postmaster pid

*# strace -p 7601*
*Process 7601 attached - interrupt to quit*
*recvfrom(8, "P\0\0\0\221\0select id_key from transla"..., 8192, 0, NULL,
NULL) = 181*
*brk(0x14ac4000) = 0x14ac4000*
*lseek(3, 0, SEEK_END) = 16130048*
*lseek(5, 0, SEEK_END) = 4661248*
*lseek(3, 0, SEEK_END) = 16130048*
*brk(0x14ae9000) = 0x14ae9000*
*brk(0x14adb000) = 0x14adb000*
*brk(0x14ad7000) = 0x14ad7000*
*sendto(7,
"\2\0\0\0\230\1\0\0H\203\314\0\4\0\0\0\5\0\0\0\0\0\0\0X\203\314\0\0\0\0\0"...,
408, 0, NULL, 0) = 408*
*sendto(8,
"1\0\0\0\0042\0\0\0\4T\0\0\0\37\0\1id_key\0\0\314\203X\0\4\0\0"..., 74, 0,
NULL, 0) = 74*
*recvfrom(8, "P\0\0\0\213\0select id_key from transla"..., 8192, 0, NULL,
NULL) = 175*
*brk(0x14ac4000) = 0x14ac4000*
*lseek(3, 0, SEEK_END) = 16130048*
*lseek(5, 0, SEEK_END) = 4661248*
*lseek(3, 0, SEEK_END) = 16130048*
*brk(0x14ae9000) = 0x14ae9000*
*brk(0x14adb000) = 0x14adb000*
*brk(0x14ad7000) = 0x14ad7000*
*sendto(8,
"1\0\0\0\0042\0\0\0\4T\0\0\0\37\0\1id_key\0\0\314\203X\0\4\0\0"..., 73, 0,
NULL, 0) = 73*
*recvfrom(8, <unfinished ...>*
*Process 7601 detached*

3) Checked the logs. We could see the same error lines we had before, but
they dissapeared after just a few minutes.

*2010-11-18 12:30:02.830
CET|1|mirai|10.19.0.43(4452)|3875|idle|4ce50a7c.f23|2010-11-18 12:14:04
CET|67/0|0|postgres| LOG: unexpected EOF on client connection*
*2010-11-18 12:30:02.841
CET|1|mirai|10.19.0.43(13943)|2275|idle|4ce508b4.8e3|2010-11-18 12:06:28
CET|75/0|0|postgres| LOG: unexpected EOF on client connection*
*2010-11-18 12:30:47.628 CET|2|||3915||4ce2d008.f4b|2010-11-16 19:40:08
CET||0|| WARNING: worker took too long to start; cancelled*
*2010-11-18 12:31:48.056 CET|3|||3915||4ce2d008.f4b|2010-11-16 19:40:08
CET||0|| WARNING: worker took too long to start; cancelled*
*2010-11-18 12:32:48.184 CET|4|||3915||4ce2d008.f4b|2010-11-16 19:40:08
CET||0|| WARNING: worker took too long to start; cancelled*
*2010-11-18 12:33:48.614 CET|5|||3915||4ce2d008.f4b|2010-11-16 19:40:08
CET||0|| WARNING: worker took too long to start; cancelled*
*2010-11-18 12:34:48.747 CET|6|||3915||4ce2d008.f4b|2010-11-16 19:40:08
CET||0|| WARNING: worker took too long to start; cancelled*
*2010-11-18 12:35:49.277 CET|7|||3915||4ce2d008.f4b|2010-11-16 19:40:08
CET||0|| WARNING: worker took too long to start; cancelled*
*2010-11-18 12:36:49.706 CET|8|||3915||4ce2d008.f4b|2010-11-16 19:40:08
CET||0|| WARNING: worker took too long to start; cancelled*
*2010-11-18 12:37:30.462
CET|1|spain|10.19.0.48(11276)|7414|idle|4ce50ffa.1cf6|2010-11-18 12:37:30
CET|2/0|0|spainadmin| LOG: unexpected EOF on client connection*

Any Ideas?

Thanks for your help and feedback!

Pablo

2010/11/16 Alvaro Herrera <alvherre(at)commandprompt(dot)com>

> Excerpts from Pablo Delgado Díaz-Pache's message of mar nov 16 12:18:09
> -0300 2010:
> > Not sure what you mean.
> >
> > Once we start getting that error in the postgres log, there is no
> autovacuum
> > entry. Only that error message once every minute (and as i wrote in my
> last
> > email, the same message all the time).
>
> Yeah. This means that the autovacuum launcher daemon is still running
> and trying to schedule new jobs, but it's not able to for some reason.
> When it tries to start a new one, it sees the old one still waiting to
> start running; and then it logs this message.
>
> To start a new job, what it does is send a signal to postmaster and wait
> for it to create a child process. This is the part that's not working.
>
> > The question is what is causing postgres to stop working. Autovacuum or
> > another reason? If it's another reason ... what could it be?
>
> I have no idea what's going on. I have never seen this problem. If you
> see it happen again, please get a stacktrace from postmaster; and/or
> strace it.
>
> --
> Álvaro Herrera <alvherre(at)commandprompt(dot)com>
> The PostgreSQL Company - Command Prompt, Inc.
> PostgreSQL Replication, Consulting, Custom Development, 24x7 support
>

In response to

Responses

Browse pgsql-admin by date

  From Date Subject
Next Message Pablo Delgado Díaz-Pache 2010-11-18 12:06:28 Re: Autovacuum seems to block database: WARNING worker took too long to start
Previous Message Tom Lane 2010-11-17 23:29:11 Re: could not map view of backend variables: error code 6