Re: Autovacuum seems to block database: WARNING worker took too long to start

From: Pablo Delgado Díaz-Pache <delgadop(at)gmail(dot)com>
To: Alvaro Herrera <alvherre(at)commandprompt(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-admin <pgsql-admin(at)postgresql(dot)org>
Subject: Re: Autovacuum seems to block database: WARNING worker took too long to start
Date: 2010-11-18 12:06:28
Message-ID: AANLkTi=TFc1a37h_RepkeRH9WcqCzn3PP=XuJvRWycu6@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-admin

Just noticed the mysql database (we just used it for wordpress) that we had
in the same server was also "stucked" and had to restart it again. Checking
the "netstat -an" I noticed the following ...

*# netstat -an | grep 3306*
*tcp 0 0 0.0.0.0:3306 0.0.0.0:*
LISTEN*
*tcp 0 0 10.19.0.47:3306 10.19.0.46:33691
SYN_RECV*
*tcp 0 0 10.19.0.47:3306 10.19.0.46:33745
SYN_RECV*
*tcp 0 0 10.19.0.47:3306 10.19.0.46:33710
SYN_RECV*
*tcp 0 0 10.19.0.47:3306 10.19.0.46:33743
SYN_RECV*
*tcp 0 0 10.19.0.47:3306 10.19.0.46:33751
SYN_RECV*
*tcp 0 0 10.19.0.47:3306 10.19.0.46:33690
SYN_RECV*
*tcp 0 0 10.19.0.47:3306 10.19.0.46:33758
SYN_RECV*
*tcp 0 0 10.19.0.47:3306 10.19.0.46:33725
SYN_RECV*
*tcp 0 0 10.19.0.47:3306 10.19.0.46:33739
SYN_RECV*
*tcp 0 0 10.19.0.47:3306 10.19.0.46:33730
SYN_RECV*
*tcp 0 0 10.19.0.47:3306 10.19.0.46:33744
SYN_RECV*
*tcp 0 0 10.19.0.47:3306 10.19.0.46:33688
SYN_RECV*
*tcp 0 0 10.19.0.47:3306 10.19.0.46:33741
SYN_RECV*
*tcp 0 0 10.19.0.47:3306 10.19.0.46:33757
SYN_RECV*
*tcp 0 0 10.19.0.47:3306 10.19.0.46:33712
SYN_RECV*
*tcp 0 0 10.19.0.47:3306 10.19.0.46:33734
SYN_RECV*
*tcp 0 0 10.19.0.47:3306 10.19.0.46:33719
SYN_RECV*
*tcp 0 0 10.19.0.47:3306 10.19.0.46:33738
SYN_RECV*
*tcp 0 0 10.19.0.47:3306 10.19.0.46:33682
SYN_RECV*
*tcp 0 0 10.19.0.47:3306 10.19.0.46:33702
SYN_RECV*
*tcp 0 0 10.19.0.47:3306 10.19.0.46:33701
SYN_RECV*
*tcp 0 0 10.19.0.47:3306 10.19.0.46:33742
SYN_RECV*
*tcp 0 0 10.19.0.47:3306 10.19.0.46:33708
SYN_RECV*
*tcp 0 0 10.19.0.47:3306 10.19.0.46:33703
SYN_RECV*
*tcp 0 0 10.19.0.47:3306 10.19.0.46:33731
SYN_RECV*
*tcp 0 0 10.19.0.47:3306 10.19.0.46:33723
SYN_RECV*
*tcp 0 0 10.19.0.47:3306 10.19.0.46:33695
SYN_RECV*
*tcp 0 0 10.19.0.47:3306 10.19.0.46:33724
SYN_RECV*
*tcp 0 0 10.19.0.47:3306 10.19.0.46:33728
SYN_RECV*
*tcp 0 0 10.19.0.47:3306 10.19.0.46:33689
SYN_RECV*
*tcp 0 0 10.19.0.47:3306 10.19.0.46:33729
SYN_RECV*
*tcp 0 0 10.19.0.47:3306 10.19.0.46:33694
SYN_RECV*
*tcp 0 0 10.19.0.47:3306 10.19.0.46:33720
SYN_RECV*
*tcp 0 0 10.19.0.47:3306 10.19.0.46:33754
SYN_RECV*
*tcp 0 0 10.19.0.47:3306 10.19.0.46:33732
SYN_RECV*
*tcp 0 0 10.19.0.47:3306 10.19.0.46:33705
SYN_RECV*
*tcp 0 0 10.19.0.47:3306 10.19.0.46:33748
SYN_RECV*
*tcp 0 0 10.19.0.47:3306 10.19.0.46:33749
SYN_RECV*
*tcp 0 0 10.19.0.47:3306 10.19.0.46:33733
SYN_RECV*
*tcp 0 0 10.19.0.47:3306 10.19.0.46:33753
SYN_RECV*
*tcp 0 0 10.19.0.47:3306 10.19.0.46:33755
SYN_RECV*
*tcp 0 0 10.19.0.47:3306 10.19.0.46:33726
SYN_RECV*
*tcp 0 0 10.19.0.47:3306 10.19.0.46:33727
SYN_RECV*
*tcp 0 0 10.19.0.47:3306 10.19.0.46:33737
SYN_RECV*
*tcp 0 0 10.19.0.47:3306 10.19.0.46:33693
SYN_RECV*
*tcp 0 0 10.19.0.47:3306 10.19.0.46:33700
SYN_RECV*
*tcp 0 0 10.19.0.47:3306 10.19.0.46:33740
SYN_RECV*
*tcp 0 0 10.19.0.47:3306 10.19.0.46:33747
SYN_RECV*
*tcp 0 0 10.19.0.47:3306 10.19.0.46:33716
SYN_RECV*
*tcp 0 0 10.19.0.47:3306 10.19.0.46:33709
SYN_RECV*
*tcp 0 0 10.19.0.47:3306 10.19.0.46:33697
SYN_RECV*
*tcp 0 0 10.19.0.47:3306 10.19.0.46:33717
SYN_RECV*
*tcp 0 0 10.19.0.47:3306 10.19.0.46:33736
SYN_RECV*
*tcp 0 0 10.19.0.47:3306 10.19.0.46:33715
SYN_RECV*
*tcp 0 0 10.19.0.47:3306 10.19.0.46:33704
SYN_RECV*
*tcp 0 0 10.19.0.47:3306 10.19.0.46:33711
SYN_RECV*
*tcp 0 0 10.19.0.47:3306 10.19.0.46:33746
SYN_RECV*
*tcp 0 0 10.19.0.47:3306 10.19.0.46:33718
SYN_RECV*
*tcp 0 0 10.19.0.47:3306 10.19.0.46:33692
SYN_RECV*
*tcp 0 0 10.19.0.47:3306 10.19.0.46:33707
SYN_RECV*
*tcp 0 0 10.19.0.47:3306 10.19.0.46:33696
SYN_RECV*
*tcp 0 0 10.19.0.47:3306 10.19.0.46:33750
SYN_RECV*
*tcp 0 0 10.19.0.47:3306 10.19.0.46:33681
SYN_RECV*
*tcp 0 0 10.19.0.47:3306 10.19.0.46:33699
SYN_RECV*
*tcp 0 0 10.19.0.47:3306 10.19.0.46:33756
SYN_RECV*
*tcp 0 0 10.19.0.47:3306 10.19.0.46:33714
SYN_RECV*
*tcp 0 0 10.19.0.47:3306 10.19.0.46:33706
SYN_RECV*
*tcp 1 0 10.19.0.47:3306 10.19.0.46:49437
TIME_WAIT*
*tcp 1 0 10.19.0.47:3306 10.19.0.46:49436
TIME_WAIT*
*tcp 1 0 10.19.0.47:3306 10.19.0.46:49439
TIME_WAIT*
*tcp 1 0 10.19.0.47:3306 10.19.0.46:49438
TIME_WAIT*
*tcp 1 0 10.19.0.47:3306 10.19.0.46:49433
TIME_WAIT*
*tcp 1 0 10.19.0.47:3306 10.19.0.46:49432
TIME_WAIT*
*tcp 1 0 10.19.0.47:3306 10.19.0.46:49435
TIME_WAIT*
*tcp 1 0 10.19.0.47:3306 10.19.0.46:49434
TIME_WAIT*
*tcp 1 0 10.19.0.47:3306 10.19.0.46:49429
TIME_WAIT*
*tcp 1 0 10.19.0.47:3306 10.19.0.46:49428
TIME_WAIT*
*tcp 1 0 10.19.0.47:3306 10.19.0.46:49431
TIME_WAIT*
*tcp 1 0 10.19.0.47:3306 10.19.0.46:49430
TIME_WAIT*
*tcp 1 0 10.19.0.47:3306 10.19.0.46:49427
TIME_WAIT*
*tcp 1 0 10.19.0.47:3306 10.19.0.46:49453
TIME_WAIT*
*tcp 1 0 10.19.0.47:3306 10.19.0.46:49452
TIME_WAIT*
*tcp 1 0 10.19.0.47:3306 10.19.0.46:49455
TIME_WAIT*
*tcp 1 0 10.19.0.47:3306 10.19.0.46:49454
TIME_WAIT*
*tcp 1 0 10.19.0.47:3306 10.19.0.46:49449
TIME_WAIT*
*tcp 1 0 10.19.0.47:3306 10.19.0.46:49448
TIME_WAIT*
*tcp 1 0 10.19.0.47:3306 10.19.0.46:49451
TIME_WAIT*
*tcp 1 0 10.19.0.47:3306 10.19.0.46:49450
TIME_WAIT*
*tcp 1 0 10.19.0.47:3306 10.19.0.46:49445
TIME_WAIT*
*tcp 1 0 10.19.0.47:3306 10.19.0.46:49444
TIME_WAIT*
*tcp 1 0 10.19.0.47:3306 10.19.0.46:49447
TIME_WAIT*
*tcp 1 0 10.19.0.47:3306 10.19.0.46:49446
TIME_WAIT*
*tcp 1 0 10.19.0.47:3306 10.19.0.46:49441
TIME_WAIT*
*tcp 1 0 10.19.0.47:3306 10.19.0.46:49440
TIME_WAIT*
*tcp 1 0 10.19.0.47:3306 10.19.0.46:49443
TIME_WAIT*
*tcp 1 0 10.19.0.47:3306 10.19.0.46:49442
TIME_WAIT*
*tcp 1 0 10.19.0.47:3306 10.19.0.46:49457
TIME_WAIT*
*tcp 1 0 10.19.0.47:3306 10.19.0.46:49456
TIME_WAIT*
*tcp 1 0 10.19.0.47:3306 10.19.0.46:30397
TIME_WAIT*
*tcp 1 0 10.19.0.47:3306 10.19.0.46:30396
TIME_WAIT*
*tcp 1 0 10.19.0.47:3306 10.19.0.46:30399
TIME_WAIT*
*tcp 1 0 10.19.0.47:3306 10.19.0.46:30398
TIME_WAIT*
*tcp 1 0 10.19.0.47:3306 10.19.0.46:30393
TIME_WAIT*
*tcp 1 0 10.19.0.47:3306 10.19.0.46:30392
TIME_WAIT*
*tcp 1 0 10.19.0.47:3306 10.19.0.46:30395
TIME_WAIT*
*tcp 1 0 10.19.0.47:3306 10.19.0.46:30394
TIME_WAIT*
*tcp 100 0 10.19.0.47:3306 10.19.0.46:30391
TIME_WAIT*
*tcp 1 0 10.19.0.47:3306 10.19.0.46:30409
TIME_WAIT*
*tcp 1 0 10.19.0.47:3306 10.19.0.46:30408
TIME_WAIT*
*tcp 1 0 10.19.0.47:3306 10.19.0.46:30411
TIME_WAIT*
*tcp 1 0 10.19.0.47:3306 10.19.0.46:30410
TIME_WAIT*
*tcp 1 0 10.19.0.47:3306 10.19.0.46:30405
TIME_WAIT*
*tcp 1 0 10.19.0.47:3306 10.19.0.46:30404
TIME_WAIT*
*tcp 1 0 10.19.0.47:3306 10.19.0.46:30407
TIME_WAIT*
*tcp 1 0 10.19.0.47:3306 10.19.0.46:30406
TIME_WAIT*
*tcp 1 0 10.19.0.47:3306 10.19.0.46:30401
TIME_WAIT*
*tcp 1 0 10.19.0.47:3306 10.19.0.46:30400
TIME_WAIT*
*tcp 1 0 10.19.0.47:3306 10.19.0.46:30403
TIME_WAIT*
*tcp 1 0 10.19.0.47:3306 10.19.0.46:30402
TIME_WAIT*

Didn't do the netstat -an for port 5432, but I guess all max_connections for
postgres were busy. Just a guess ...

All connections are done from internal servers, and there are no firewall
rules in that interface. Just in case here it goes the iptables -L -n -v

*# iptables -L -n -v*
*Chain INPUT (policy ACCEPT 0 packets, 0 bytes)*
* pkts bytes target prot opt in out source
destination*
* 219M 56G mirai all -- * * 0.0.0.0/0
0.0.0.0/0*
*
*
*Chain FORWARD (policy ACCEPT 0 packets, 0 bytes)*
* pkts bytes target prot opt in out source
destination*
*
*
*Chain OUTPUT (policy ACCEPT 170M packets, 392G bytes)*
* pkts bytes target prot opt in out source
destination*
*
*
*Chain mirai (1 references)*
* pkts bytes target prot opt in out source
destination*
* 12M 6207M ACCEPT all -- lo * 0.0.0.0/0
0.0.0.0/0*
* 6186 509K ACCEPT icmp -- * * 0.0.0.0/0
0.0.0.0/0 icmp type 255*
* 205M 50G ACCEPT all -- * * 0.0.0.0/0
0.0.0.0/0 state RELATED,ESTABLISHED*
* 30 1800 ACCEPT all -- eth0 * x.x.x.x 0.0.0.0/0*
*1023K 61M ACCEPT all -- eth1 * 10.0.0.0/8
0.0.0.0/0*
* 6224 373K ACCEPT tcp -- eth0 * 0.0.0.0/0
0.0.0.0/0 tcp multiport dports 80,443*
* 308K 183M REJECT all -- * * 0.0.0.0/0
0.0.0.0/0 reject-with icmp-port-unreachable*

2010/11/18 Pablo Delgado Díaz-Pache <delgadop(at)gmail(dot)com>

> Well, we had the problem again. This is what we did
>
> 1) A "ps" to check what processes were running. Many "postmaster" processes
> in defunct state. A few postgres connections still working. A few "psql" (by
> shell script) queries hanged (they are scripts to monitor state of database.
> We use big-sister to do so).
>
> The normal state of the server is 1 postmaster pid and many "postgres"
> connections in "idle" state.
>
> This the result when we were having the problem. I've excluded irrelevant
> processes.
>
> *postgres 892 3889 0 09:07 ? 00:01:05 postgres: postgres
> international 10.19.0.51(49264) idle*
> *postgres 934 3889 0 12:00 ? 00:00:04 [postmaster] <defunct>*
> *postgres 935 3889 0 12:00 ? 00:00:04 [postmaster] <defunct>*
> *postgres 936 3889 0 12:00 ? 00:00:00 [postmaster] <defunct>*
> *postgres 938 3889 0 12:00 ? 00:00:01 [postmaster] <defunct>*
> *postgres 1244 3889 0 12:01 ? 00:00:01 [postmaster] <defunct>*
> *postgres 1453 3889 0 12:02 ? 00:00:02 [postmaster] <defunct>*
> *postgres 1563 3889 0 12:03 ? 00:00:10 [postmaster] <defunct>*
> *postgres 1862 3889 0 12:04 ? 00:00:04 [postmaster] <defunct>*
> *postgres 2275 3889 0 12:06 ? 00:00:02 [postmaster] <defunct>*
> *root 2287 49 0 Oct29 ? 00:00:00 [kondemand/0]*
> *root 2288 49 0 Oct29 ? 00:00:00 [kondemand/1]*
> *postgres 2485 3889 0 12:07 ? 00:00:05 postgres: postgres
> international 10.19.0.51(59066) idle*
> *postgres 2490 3889 1 12:07 ? 00:00:26 postgres: postgres
> international 10.19.0.51(59070) idle*
> *postgres 2568 3889 0 12:07 ? 00:00:00 postgres: postgres
> international 10.19.0.50(1249) idle*
> *postgres 3544 3889 0 12:12 ? 00:00:07 [postmaster] <defunct>*
> *postgres 3875 3889 0 12:14 ? 00:00:03 [postmaster] <defunct>*
> *postgres 3889 1 0 Nov16 ? 00:01:24 /usr/bin/postmaster -p
> 5432 -D /var/lib/pgsql/data*
> *postgres 3896 3889 0 Nov16 ? 00:00:00 postgres: logger process*
> *postgres 3913 3889 0 Nov16 ? 00:00:05 postgres: writer process*
> *postgres 3914 3889 0 Nov16 ? 00:00:00 postgres: wal writer
> process*
> *postgres 3915 3889 0 Nov16 ? 00:00:01 postgres: autovacuum
> launcher process*
> *postgres 3916 3889 0 Nov16 ? 00:01:42 postgres: stats collector
> process*
> *postgres 4169 3889 0 12:15 ? 00:00:02 [postmaster] <defunct>*
> *postgres 4170 3889 0 12:15 ? 00:00:04 [postmaster] <defunct>*
> *postgres 4171 3889 0 12:15 ? 00:00:07 [postmaster] <defunct>*
> *postgres 4281 3889 0 12:15 ? 00:00:04 [postmaster] <defunct>*
> *postgres 4672 3889 0 12:17 ? 00:00:01 [postmaster] <defunct>*
> *postgres 5038 3889 0 12:19 ? 00:00:01 [postmaster] <defunct>*
> *postgres 5203 3889 0 12:20 ? 00:00:00 [postmaster] <defunct>*
> *postgres 5484 3889 0 12:21 ? 00:00:01 [postmaster] <defunct>*
> *postgres 5694 3889 0 12:22 ? 00:00:01 postgres: postgres
> international 10.19.0.49(29113) idle*
> *postgres 5769 3889 0 12:23 ? 00:00:04 [postmaster] <defunct>*
> *postgres 5832 3889 0 12:23 ? 00:00:03 postgres: postgres
> international 10.19.0.49(29129) idle*
> *postgres 5835 3889 0 12:23 ? 00:00:02 [postmaster] <defunct>*
> *postgres 5979 3889 0 12:24 ? 00:00:01 [postmaster] <defunct>*
> *postgres 6243 3889 0 12:25 ? 00:00:00 [postmaster] <defunct>*
> *postgres 6424 3889 0 12:26 ? 00:00:01 postgres: postgres
> international 10.19.0.49(45160) idle*
> *postgres 6425 3889 0 12:26 ? 00:00:01 [postmaster] <defunct>*
> *postgres 6429 3889 0 12:26 ? 00:00:01 [postmaster] <defunct>*
> *postgres 6599 3889 0 12:27 ? 00:00:02 [postmaster] <defunct>*
> *postgres 6635 3889 0 12:27 ? 00:00:00 [postmaster] <defunct>*
> *postgres 6653 3889 0 12:27 ? 00:00:00 [postmaster] <defunct>*
> *postgres 6654 3889 0 12:27 ? 00:00:00 [postmaster] <defunct>*
> *postgres 6655 3889 0 12:27 ? 00:00:00 [postmaster] <defunct>*
> *postgres 6663 3889 0 12:27 ? 00:00:00 [postmaster] <defunct>*
> *postgres 6664 3889 0 12:27 ? 00:00:00 [postmaster] <defunct>*
> *postgres 6665 3889 0 12:27 ? 00:00:00 [postmaster] <defunct>*
> *postgres 6666 3889 0 12:27 ? 00:00:00 [postmaster] <defunct>*
> *postgres 6667 3889 0 12:27 ? 00:00:00 [postmaster] <defunct>*
> *postgres 6668 3889 0 12:27 ? 00:00:00 [postmaster] <defunct>*
> *postgres 6669 3889 0 12:27 ? 00:00:00 [postmaster] <defunct>*
> *postgres 6670 3889 0 12:27 ? 00:00:00 [postmaster] <defunct>*
> *postgres 6671 3889 0 12:27 ? 00:00:00 [postmaster] <defunct>*
> *postgres 6672 3889 0 12:27 ? 00:00:00 [postmaster] <defunct>*
> *postgres 6673 3889 0 12:27 ? 00:00:00 [postmaster] <defunct>*
> *postgres 6773 3889 0 12:28 ? 00:00:00 [postmaster] <defunct>*
> *postgres 6774 3889 0 12:28 ? 00:00:00 [postmaster] <defunct>*
> *postgres 6778 3889 0 12:28 ? 00:00:01 [postmaster] <defunct>*
> *postgres 6779 3889 0 12:28 ? 00:00:00 [postmaster] <defunct>*
> *postgres 6785 3889 0 12:28 ? 00:00:00 [postmaster] <defunct>*
> *postgres 6787 3889 0 12:28 ? 00:00:00 [postmaster] <defunct>*
> *postgres 6788 3889 0 12:28 ? 00:00:00 [postmaster] <defunct>*
> *postgres 6796 3889 0 12:28 ? 00:00:00 [postmaster] <defunct>*
> *postgres 6797 3889 0 12:28 ? 00:00:00 [postmaster] <defunct>*
> *postgres 6837 3889 0 12:28 ? 00:00:00 [postmaster] <defunct>*
> *postgres 6838 3889 0 12:28 ? 00:00:00 [postmaster] <defunct>*
> *postgres 6839 3889 0 12:28 ? 00:00:00 [postmaster] <defunct>*
> *postgres 6840 3889 0 12:28 ? 00:00:00 [postmaster] <defunct>*
> *postgres 6841 3889 0 12:28 ? 00:00:00 [postmaster] <defunct>*
> *postgres 6842 3889 0 12:28 ? 00:00:00 [postmaster] <defunct>*
> *postgres 6843 3889 0 12:28 ? 00:00:00 [postmaster] <defunct>*
> *postgres 6844 3889 0 12:28 ? 00:00:00 [postmaster] <defunct>*
> *postgres 6852 3889 0 12:28 ? 00:00:00 [postmaster] <defunct>*
> *postgres 6859 3889 0 12:28 ? 00:00:00 [postmaster] <defunct>*
> *postgres 6860 3889 0 12:28 ? 00:00:00 [postmaster] <defunct>*
> *postgres 6862 3889 0 12:28 ? 00:00:00 [postmaster] <defunct>*
> *postgres 6872 3889 0 12:28 ? 00:00:00 [postmaster] <defunct>*
> *postgres 6873 3889 0 12:28 ? 00:00:00 [postmaster] <defunct>*
> *postgres 6874 3889 0 12:28 ? 00:00:00 [postmaster] <defunct>*
> *postgres 6910 3889 0 12:29 ? 00:00:00 [postmaster] <defunct>*
> *postgres 6970 3889 0 12:29 ? 00:00:00 [postmaster] <defunct>*
> *postgres 6971 3889 0 12:29 ? 00:00:00 [postmaster] <defunct>*
> *postgres 6972 3889 0 12:29 ? 00:00:00 [postmaster] <defunct>*
> *postgres 6973 3889 0 12:29 ? 00:00:00 [postmaster] <defunct>*
> *postgres 6978 3889 0 12:29 ? 00:00:00 [postmaster] <defunct>*
> *postgres 6990 3889 0 12:29 ? 00:00:00 [postmaster] <defunct>*
> *postgres 6991 3889 0 12:29 ? 00:00:00 [postmaster] <defunct>*
> *postgres 6992 3889 0 12:29 ? 00:00:00 [postmaster] <defunct>*
> *postgres 6993 3889 0 12:29 ? 00:00:00 [postmaster] <defunct>*
> *postgres 7000 3889 0 12:29 ? 00:00:00 [postmaster] <defunct>*
> *postgres 7088 3889 0 12:29 ? 00:00:00 [postmaster] <defunct>*
> *postgres 7089 3889 0 12:29 ? 00:00:00 [postmaster] <defunct>*
> *postgres 7090 3889 0 12:29 ? 00:00:00 [postmaster] <defunct>*
> *postgres 7091 3889 0 12:29 ? 00:00:00 [postmaster] <defunct>*
> *postgres 7094 3889 0 12:29 ? 00:00:00 [postmaster] <defunct>*
> *postgres 7095 3889 0 12:29 ? 00:00:00 [postmaster] <defunct>*
> *postgres 7096 3889 0 12:29 ? 00:00:00 [postmaster] <defunct>*
> *postgres 7122 3889 0 12:29 ? 00:00:00 [postmaster] <defunct>*
> *postgres 7126 3889 0 12:29 ? 00:00:03 postgres: postgres mrs
> 10.19.0.51(7717) idle*
> *postgres 7127 3889 0 12:29 ? 00:00:00 [postmaster] <defunct>*
> *root 7152 7150 0 12:30 ? 00:00:00 /usr/bin/psql -h
> ddbb.priv -U bigsister mrs*
> *root 7157 7155 0 12:30 ? 00:00:00 /usr/bin/psql -h
> ddbb.priv -U bigsister mrs*
> *root 7168 7166 0 12:31 ? 00:00:00 /usr/bin/psql -h
> ddbb.priv -U bigsister mrs*
> *root 7174 7172 0 12:31 ? 00:00:00 /usr/bin/psql -h
> ddbb.priv -U bigsister mrs*
> *root 7188 7186 0 12:32 ? 00:00:00 /usr/bin/psql -h
> ddbb.priv -U bigsister mrs*
> *root 7196 7194 0 12:32 ? 00:00:00 /usr/bin/psql -h
> ddbb.priv -U bigsister mrs*
> *root 7258 7256 0 12:34 ? 00:00:00 /usr/bin/psql -h
> ddbb.priv -U bigsister mrs*
> *root 7263 7261 0 12:34 ? 00:00:00 /usr/bin/psql -h
> ddbb.priv -U bigsister mrs*
> *root 7308 7306 0 12:35 ? 00:00:00 /usr/bin/psql -h
> ddbb.priv -U bigsister mrs*
> *root 7313 7311 0 12:35 ? 00:00:00 /usr/bin/psql -h
> ddbb.priv -U bigsister mrs*
> *root 7325 7323 0 12:36 ? 00:00:00 /usr/bin/psql -h
> ddbb.priv -U bigsister mrs*
> *root 7330 7328 0 12:36 ? 00:00:00 /usr/bin/psql -h
> ddbb.priv -U bigsister mrs*
> *root 7371 7369 0 12:37 ? 00:00:00 /usr/bin/psql -h
> ddbb.priv -U bigsister mrs*
> *postgres 22959 3889 0 11:09 ? 00:00:01 postgres: postgres spain
> 10.19.0.45(57042) idle*
> *postgres 22961 3889 0 11:09 ? 00:00:22 postgres: postgres mirai
> 10.19.0.45(57043) idle*
> *postgres 23225 3889 0 11:10 ? 00:00:00 postgres: postgres xmas
> 10.19.0.45(57061) idle*
> *postgres 25844 3889 0 11:23 ? 00:00:01 postgres: postgres spain
> 10.19.0.51(11291) idle*
> *postgres 31942 3889 0 11:53 ? 00:00:00 [postmaster] <defunct>*
> *postgres 31973 3889 0 11:53 ? 00:00:16 [postmaster] <defunct>*
>
>
> 2) We did a strace to the postmaster pid. However we had 2 postmasters not
> dead
>
> # ps -fea |grep -i postmaster
> postgres 3889 1 0 Nov16 ? 00:01:24 /usr/bin/postmaster -p 5432
> -D /var/lib/pgsql/data
> postgres 7601 3889 0 12:37 ? 00:00:00 /usr/bin/postmaster -p 5432
> -D /var/lib/pgsql/data
>
> As soon as we did a "strace" to the 3889 pid everything started to work
> again. Not sure it was a coincidence but that was the way it was.
>
> *# strace -p 3889*
> *Process 3889 attached - interrupt to quit*
> *select(6, [3 4 5], NULL, NULL, {56, 930000}) = ? ERESTARTNOHAND (To be
> restarted)*
> *--- SIGUSR1 (User defined signal 1) @ 0 (0) ---*
> *rt_sigprocmask(SIG_SETMASK, ~[ILL TRAP ABRT BUS FPE SEGV CONT SYS RTMIN
> RT_1], NULL, 8) = 0*
> *clone(child_stack=0,
> flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD,
> child_tidptr=0x2b4f6591d440) = 7766*
> *rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0*
> *rt_sigreturn(0x2) = -1 EINTR (Interrupted system
> call)*
> *rt_sigprocmask(SIG_SETMASK, ~[ILL TRAP ABRT BUS FPE SEGV CONT SYS RTMIN
> RT_1], NULL, 8) = 0*
> *rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0*
> *select(6, [3 4 5], NULL, NULL, {60, 0}) = ? ERESTARTNOHAND (To be
> restarted)*
> *--- SIGCHLD (Child exited) @ 0 (0) ---*
> *rt_sigprocmask(SIG_SETMASK, ~[ILL TRAP ABRT BUS FPE SEGV CONT SYS RTMIN
> RT_1], NULL, 8) = 0*
> *wait4(-1, [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], WNOHANG, NULL) = 7766*
> *wait4(-1, 0x7fffc37259a4, WNOHANG, NULL) = 0*
> *rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0*
> *rt_sigreturn(0x2) = -1 EINTR (Interrupted system
> call)*
> *rt_sigprocmask(SIG_SETMASK, ~[ILL TRAP ABRT BUS FPE SEGV CONT SYS RTMIN
> RT_1], NULL, 8) = 0*
> *rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0*
> *select(6, [3 4 5], NULL, NULL, {60, 0}) = ? ERESTARTNOHAND (To be
> restarted)*
> *--- SIGUSR1 (User defined signal 1) @ 0 (0) ---*
> *rt_sigprocmask(SIG_SETMASK, ~[ILL TRAP ABRT BUS FPE SEGV CONT SYS RTMIN
> RT_1], NULL, 8) = 0*
> *clone(child_stack=0,
> flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD,
> child_tidptr=0x2b4f6591d440) = 7767*
> *rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0*
> *rt_sigreturn(0x2) = -1 EINTR (Interrupted system
> call)*
> *rt_sigprocmask(SIG_SETMASK, ~[ILL TRAP ABRT BUS FPE SEGV CONT SYS RTMIN
> RT_1], NULL, 8) = 0*
> *rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0*
> *select(6, [3 4 5], NULL, NULL, {60, 0}) = ? ERESTARTNOHAND (To be
> restarted)*
> *--- SIGCHLD (Child exited) @ 0 (0) ---*
>
>
> I also straced the other postmaster pid
>
> *# strace -p 7601*
> *Process 7601 attached - interrupt to quit*
> *recvfrom(8, "P\0\0\0\221\0select id_key from transla"..., 8192, 0, NULL,
> NULL) = 181*
> *brk(0x14ac4000) = 0x14ac4000*
> *lseek(3, 0, SEEK_END) = 16130048*
> *lseek(5, 0, SEEK_END) = 4661248*
> *lseek(3, 0, SEEK_END) = 16130048*
> *brk(0x14ae9000) = 0x14ae9000*
> *brk(0x14adb000) = 0x14adb000*
> *brk(0x14ad7000) = 0x14ad7000*
> *sendto(7,
> "\2\0\0\0\230\1\0\0H\203\314\0\4\0\0\0\5\0\0\0\0\0\0\0X\203\314\0\0\0\0\0"...,
> 408, 0, NULL, 0) = 408*
> *sendto(8,
> "1\0\0\0\0042\0\0\0\4T\0\0\0\37\0\1id_key\0\0\314\203X\0\4\0\0"..., 74, 0,
> NULL, 0) = 74*
> *recvfrom(8, "P\0\0\0\213\0select id_key from transla"..., 8192, 0, NULL,
> NULL) = 175*
> *brk(0x14ac4000) = 0x14ac4000*
> *lseek(3, 0, SEEK_END) = 16130048*
> *lseek(5, 0, SEEK_END) = 4661248*
> *lseek(3, 0, SEEK_END) = 16130048*
> *brk(0x14ae9000) = 0x14ae9000*
> *brk(0x14adb000) = 0x14adb000*
> *brk(0x14ad7000) = 0x14ad7000*
> *sendto(8,
> "1\0\0\0\0042\0\0\0\4T\0\0\0\37\0\1id_key\0\0\314\203X\0\4\0\0"..., 73, 0,
> NULL, 0) = 73*
> *recvfrom(8, <unfinished ...>*
> *Process 7601 detached*
>
>
> 3) Checked the logs. We could see the same error lines we had before, but
> they dissapeared after just a few minutes.
>
> *2010-11-18 12:30:02.830
> CET|1|mirai|10.19.0.43(4452)|3875|idle|4ce50a7c.f23|2010-11-18 12:14:04
> CET|67/0|0|postgres| LOG: unexpected EOF on client connection*
> *2010-11-18 12:30:02.841
> CET|1|mirai|10.19.0.43(13943)|2275|idle|4ce508b4.8e3|2010-11-18 12:06:28
> CET|75/0|0|postgres| LOG: unexpected EOF on client connection*
> *2010-11-18 12:30:47.628 CET|2|||3915||4ce2d008.f4b|2010-11-16 19:40:08
> CET||0|| WARNING: worker took too long to start; cancelled*
> *2010-11-18 12:31:48.056 CET|3|||3915||4ce2d008.f4b|2010-11-16 19:40:08
> CET||0|| WARNING: worker took too long to start; cancelled*
> *2010-11-18 12:32:48.184 CET|4|||3915||4ce2d008.f4b|2010-11-16 19:40:08
> CET||0|| WARNING: worker took too long to start; cancelled*
> *2010-11-18 12:33:48.614 CET|5|||3915||4ce2d008.f4b|2010-11-16 19:40:08
> CET||0|| WARNING: worker took too long to start; cancelled*
> *2010-11-18 12:34:48.747 CET|6|||3915||4ce2d008.f4b|2010-11-16 19:40:08
> CET||0|| WARNING: worker took too long to start; cancelled*
> *2010-11-18 12:35:49.277 CET|7|||3915||4ce2d008.f4b|2010-11-16 19:40:08
> CET||0|| WARNING: worker took too long to start; cancelled*
> *2010-11-18 12:36:49.706 CET|8|||3915||4ce2d008.f4b|2010-11-16 19:40:08
> CET||0|| WARNING: worker took too long to start; cancelled*
> *2010-11-18 12:37:30.462
> CET|1|spain|10.19.0.48(11276)|7414|idle|4ce50ffa.1cf6|2010-11-18 12:37:30
> CET|2/0|0|spainadmin| LOG: unexpected EOF on client connection*
>
>
> Any Ideas?
>
> Thanks for your help and feedback!
>
> Pablo
>
> 2010/11/16 Alvaro Herrera <alvherre(at)commandprompt(dot)com>
>
>> Excerpts from Pablo Delgado Díaz-Pache's message of mar nov 16 12:18:09
>> -0300 2010:
>>
>> > Not sure what you mean.
>> >
>> > Once we start getting that error in the postgres log, there is no
>> autovacuum
>> > entry. Only that error message once every minute (and as i wrote in my
>> last
>> > email, the same message all the time).
>>
>> Yeah. This means that the autovacuum launcher daemon is still running
>> and trying to schedule new jobs, but it's not able to for some reason.
>> When it tries to start a new one, it sees the old one still waiting to
>> start running; and then it logs this message.
>>
>> To start a new job, what it does is send a signal to postmaster and wait
>> for it to create a child process. This is the part that's not working.
>>
>> > The question is what is causing postgres to stop working. Autovacuum or
>> > another reason? If it's another reason ... what could it be?
>>
>> I have no idea what's going on. I have never seen this problem. If you
>> see it happen again, please get a stacktrace from postmaster; and/or
>> strace it.
>>
>> --
>> Álvaro Herrera <alvherre(at)commandprompt(dot)com>
>> The PostgreSQL Company - Command Prompt, Inc.
>> PostgreSQL Replication, Consulting, Custom Development, 24x7 support
>>
>
>

In response to

Browse pgsql-admin by date

  From Date Subject
Next Message Alvaro Herrera 2010-11-18 14:40:53 Re: Autovacuum seems to block database: WARNING worker took too long to start
Previous Message Pablo Delgado Díaz-Pache 2010-11-18 11:57:16 Re: Autovacuum seems to block database: WARNING worker took too long to start