From: | Pablo Delgado Díaz-Pache <delgadop(at)gmail(dot)com> |
---|---|
To: | Alvaro Herrera <alvherre(at)commandprompt(dot)com> |
Cc: | Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-admin <pgsql-admin(at)postgresql(dot)org> |
Subject: | Re: Autovacuum seems to block database: WARNING worker took too long to start |
Date: | 2010-11-18 12:06:28 |
Message-ID: | AANLkTi=TFc1a37h_RepkeRH9WcqCzn3PP=XuJvRWycu6@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-admin |
Just noticed the mysql database (we just used it for wordpress) that we had
in the same server was also "stucked" and had to restart it again. Checking
the "netstat -an" I noticed the following ...
*# netstat -an | grep 3306*
*tcp 0 0 0.0.0.0:3306 0.0.0.0:*
LISTEN*
*tcp 0 0 10.19.0.47:3306 10.19.0.46:33691
SYN_RECV*
*tcp 0 0 10.19.0.47:3306 10.19.0.46:33745
SYN_RECV*
*tcp 0 0 10.19.0.47:3306 10.19.0.46:33710
SYN_RECV*
*tcp 0 0 10.19.0.47:3306 10.19.0.46:33743
SYN_RECV*
*tcp 0 0 10.19.0.47:3306 10.19.0.46:33751
SYN_RECV*
*tcp 0 0 10.19.0.47:3306 10.19.0.46:33690
SYN_RECV*
*tcp 0 0 10.19.0.47:3306 10.19.0.46:33758
SYN_RECV*
*tcp 0 0 10.19.0.47:3306 10.19.0.46:33725
SYN_RECV*
*tcp 0 0 10.19.0.47:3306 10.19.0.46:33739
SYN_RECV*
*tcp 0 0 10.19.0.47:3306 10.19.0.46:33730
SYN_RECV*
*tcp 0 0 10.19.0.47:3306 10.19.0.46:33744
SYN_RECV*
*tcp 0 0 10.19.0.47:3306 10.19.0.46:33688
SYN_RECV*
*tcp 0 0 10.19.0.47:3306 10.19.0.46:33741
SYN_RECV*
*tcp 0 0 10.19.0.47:3306 10.19.0.46:33757
SYN_RECV*
*tcp 0 0 10.19.0.47:3306 10.19.0.46:33712
SYN_RECV*
*tcp 0 0 10.19.0.47:3306 10.19.0.46:33734
SYN_RECV*
*tcp 0 0 10.19.0.47:3306 10.19.0.46:33719
SYN_RECV*
*tcp 0 0 10.19.0.47:3306 10.19.0.46:33738
SYN_RECV*
*tcp 0 0 10.19.0.47:3306 10.19.0.46:33682
SYN_RECV*
*tcp 0 0 10.19.0.47:3306 10.19.0.46:33702
SYN_RECV*
*tcp 0 0 10.19.0.47:3306 10.19.0.46:33701
SYN_RECV*
*tcp 0 0 10.19.0.47:3306 10.19.0.46:33742
SYN_RECV*
*tcp 0 0 10.19.0.47:3306 10.19.0.46:33708
SYN_RECV*
*tcp 0 0 10.19.0.47:3306 10.19.0.46:33703
SYN_RECV*
*tcp 0 0 10.19.0.47:3306 10.19.0.46:33731
SYN_RECV*
*tcp 0 0 10.19.0.47:3306 10.19.0.46:33723
SYN_RECV*
*tcp 0 0 10.19.0.47:3306 10.19.0.46:33695
SYN_RECV*
*tcp 0 0 10.19.0.47:3306 10.19.0.46:33724
SYN_RECV*
*tcp 0 0 10.19.0.47:3306 10.19.0.46:33728
SYN_RECV*
*tcp 0 0 10.19.0.47:3306 10.19.0.46:33689
SYN_RECV*
*tcp 0 0 10.19.0.47:3306 10.19.0.46:33729
SYN_RECV*
*tcp 0 0 10.19.0.47:3306 10.19.0.46:33694
SYN_RECV*
*tcp 0 0 10.19.0.47:3306 10.19.0.46:33720
SYN_RECV*
*tcp 0 0 10.19.0.47:3306 10.19.0.46:33754
SYN_RECV*
*tcp 0 0 10.19.0.47:3306 10.19.0.46:33732
SYN_RECV*
*tcp 0 0 10.19.0.47:3306 10.19.0.46:33705
SYN_RECV*
*tcp 0 0 10.19.0.47:3306 10.19.0.46:33748
SYN_RECV*
*tcp 0 0 10.19.0.47:3306 10.19.0.46:33749
SYN_RECV*
*tcp 0 0 10.19.0.47:3306 10.19.0.46:33733
SYN_RECV*
*tcp 0 0 10.19.0.47:3306 10.19.0.46:33753
SYN_RECV*
*tcp 0 0 10.19.0.47:3306 10.19.0.46:33755
SYN_RECV*
*tcp 0 0 10.19.0.47:3306 10.19.0.46:33726
SYN_RECV*
*tcp 0 0 10.19.0.47:3306 10.19.0.46:33727
SYN_RECV*
*tcp 0 0 10.19.0.47:3306 10.19.0.46:33737
SYN_RECV*
*tcp 0 0 10.19.0.47:3306 10.19.0.46:33693
SYN_RECV*
*tcp 0 0 10.19.0.47:3306 10.19.0.46:33700
SYN_RECV*
*tcp 0 0 10.19.0.47:3306 10.19.0.46:33740
SYN_RECV*
*tcp 0 0 10.19.0.47:3306 10.19.0.46:33747
SYN_RECV*
*tcp 0 0 10.19.0.47:3306 10.19.0.46:33716
SYN_RECV*
*tcp 0 0 10.19.0.47:3306 10.19.0.46:33709
SYN_RECV*
*tcp 0 0 10.19.0.47:3306 10.19.0.46:33697
SYN_RECV*
*tcp 0 0 10.19.0.47:3306 10.19.0.46:33717
SYN_RECV*
*tcp 0 0 10.19.0.47:3306 10.19.0.46:33736
SYN_RECV*
*tcp 0 0 10.19.0.47:3306 10.19.0.46:33715
SYN_RECV*
*tcp 0 0 10.19.0.47:3306 10.19.0.46:33704
SYN_RECV*
*tcp 0 0 10.19.0.47:3306 10.19.0.46:33711
SYN_RECV*
*tcp 0 0 10.19.0.47:3306 10.19.0.46:33746
SYN_RECV*
*tcp 0 0 10.19.0.47:3306 10.19.0.46:33718
SYN_RECV*
*tcp 0 0 10.19.0.47:3306 10.19.0.46:33692
SYN_RECV*
*tcp 0 0 10.19.0.47:3306 10.19.0.46:33707
SYN_RECV*
*tcp 0 0 10.19.0.47:3306 10.19.0.46:33696
SYN_RECV*
*tcp 0 0 10.19.0.47:3306 10.19.0.46:33750
SYN_RECV*
*tcp 0 0 10.19.0.47:3306 10.19.0.46:33681
SYN_RECV*
*tcp 0 0 10.19.0.47:3306 10.19.0.46:33699
SYN_RECV*
*tcp 0 0 10.19.0.47:3306 10.19.0.46:33756
SYN_RECV*
*tcp 0 0 10.19.0.47:3306 10.19.0.46:33714
SYN_RECV*
*tcp 0 0 10.19.0.47:3306 10.19.0.46:33706
SYN_RECV*
*tcp 1 0 10.19.0.47:3306 10.19.0.46:49437
TIME_WAIT*
*tcp 1 0 10.19.0.47:3306 10.19.0.46:49436
TIME_WAIT*
*tcp 1 0 10.19.0.47:3306 10.19.0.46:49439
TIME_WAIT*
*tcp 1 0 10.19.0.47:3306 10.19.0.46:49438
TIME_WAIT*
*tcp 1 0 10.19.0.47:3306 10.19.0.46:49433
TIME_WAIT*
*tcp 1 0 10.19.0.47:3306 10.19.0.46:49432
TIME_WAIT*
*tcp 1 0 10.19.0.47:3306 10.19.0.46:49435
TIME_WAIT*
*tcp 1 0 10.19.0.47:3306 10.19.0.46:49434
TIME_WAIT*
*tcp 1 0 10.19.0.47:3306 10.19.0.46:49429
TIME_WAIT*
*tcp 1 0 10.19.0.47:3306 10.19.0.46:49428
TIME_WAIT*
*tcp 1 0 10.19.0.47:3306 10.19.0.46:49431
TIME_WAIT*
*tcp 1 0 10.19.0.47:3306 10.19.0.46:49430
TIME_WAIT*
*tcp 1 0 10.19.0.47:3306 10.19.0.46:49427
TIME_WAIT*
*tcp 1 0 10.19.0.47:3306 10.19.0.46:49453
TIME_WAIT*
*tcp 1 0 10.19.0.47:3306 10.19.0.46:49452
TIME_WAIT*
*tcp 1 0 10.19.0.47:3306 10.19.0.46:49455
TIME_WAIT*
*tcp 1 0 10.19.0.47:3306 10.19.0.46:49454
TIME_WAIT*
*tcp 1 0 10.19.0.47:3306 10.19.0.46:49449
TIME_WAIT*
*tcp 1 0 10.19.0.47:3306 10.19.0.46:49448
TIME_WAIT*
*tcp 1 0 10.19.0.47:3306 10.19.0.46:49451
TIME_WAIT*
*tcp 1 0 10.19.0.47:3306 10.19.0.46:49450
TIME_WAIT*
*tcp 1 0 10.19.0.47:3306 10.19.0.46:49445
TIME_WAIT*
*tcp 1 0 10.19.0.47:3306 10.19.0.46:49444
TIME_WAIT*
*tcp 1 0 10.19.0.47:3306 10.19.0.46:49447
TIME_WAIT*
*tcp 1 0 10.19.0.47:3306 10.19.0.46:49446
TIME_WAIT*
*tcp 1 0 10.19.0.47:3306 10.19.0.46:49441
TIME_WAIT*
*tcp 1 0 10.19.0.47:3306 10.19.0.46:49440
TIME_WAIT*
*tcp 1 0 10.19.0.47:3306 10.19.0.46:49443
TIME_WAIT*
*tcp 1 0 10.19.0.47:3306 10.19.0.46:49442
TIME_WAIT*
*tcp 1 0 10.19.0.47:3306 10.19.0.46:49457
TIME_WAIT*
*tcp 1 0 10.19.0.47:3306 10.19.0.46:49456
TIME_WAIT*
*tcp 1 0 10.19.0.47:3306 10.19.0.46:30397
TIME_WAIT*
*tcp 1 0 10.19.0.47:3306 10.19.0.46:30396
TIME_WAIT*
*tcp 1 0 10.19.0.47:3306 10.19.0.46:30399
TIME_WAIT*
*tcp 1 0 10.19.0.47:3306 10.19.0.46:30398
TIME_WAIT*
*tcp 1 0 10.19.0.47:3306 10.19.0.46:30393
TIME_WAIT*
*tcp 1 0 10.19.0.47:3306 10.19.0.46:30392
TIME_WAIT*
*tcp 1 0 10.19.0.47:3306 10.19.0.46:30395
TIME_WAIT*
*tcp 1 0 10.19.0.47:3306 10.19.0.46:30394
TIME_WAIT*
*tcp 100 0 10.19.0.47:3306 10.19.0.46:30391
TIME_WAIT*
*tcp 1 0 10.19.0.47:3306 10.19.0.46:30409
TIME_WAIT*
*tcp 1 0 10.19.0.47:3306 10.19.0.46:30408
TIME_WAIT*
*tcp 1 0 10.19.0.47:3306 10.19.0.46:30411
TIME_WAIT*
*tcp 1 0 10.19.0.47:3306 10.19.0.46:30410
TIME_WAIT*
*tcp 1 0 10.19.0.47:3306 10.19.0.46:30405
TIME_WAIT*
*tcp 1 0 10.19.0.47:3306 10.19.0.46:30404
TIME_WAIT*
*tcp 1 0 10.19.0.47:3306 10.19.0.46:30407
TIME_WAIT*
*tcp 1 0 10.19.0.47:3306 10.19.0.46:30406
TIME_WAIT*
*tcp 1 0 10.19.0.47:3306 10.19.0.46:30401
TIME_WAIT*
*tcp 1 0 10.19.0.47:3306 10.19.0.46:30400
TIME_WAIT*
*tcp 1 0 10.19.0.47:3306 10.19.0.46:30403
TIME_WAIT*
*tcp 1 0 10.19.0.47:3306 10.19.0.46:30402
TIME_WAIT*
Didn't do the netstat -an for port 5432, but I guess all max_connections for
postgres were busy. Just a guess ...
All connections are done from internal servers, and there are no firewall
rules in that interface. Just in case here it goes the iptables -L -n -v
*# iptables -L -n -v*
*Chain INPUT (policy ACCEPT 0 packets, 0 bytes)*
* pkts bytes target prot opt in out source
destination*
* 219M 56G mirai all -- * * 0.0.0.0/0
0.0.0.0/0*
*
*
*Chain FORWARD (policy ACCEPT 0 packets, 0 bytes)*
* pkts bytes target prot opt in out source
destination*
*
*
*Chain OUTPUT (policy ACCEPT 170M packets, 392G bytes)*
* pkts bytes target prot opt in out source
destination*
*
*
*Chain mirai (1 references)*
* pkts bytes target prot opt in out source
destination*
* 12M 6207M ACCEPT all -- lo * 0.0.0.0/0
0.0.0.0/0*
* 6186 509K ACCEPT icmp -- * * 0.0.0.0/0
0.0.0.0/0 icmp type 255*
* 205M 50G ACCEPT all -- * * 0.0.0.0/0
0.0.0.0/0 state RELATED,ESTABLISHED*
* 30 1800 ACCEPT all -- eth0 * x.x.x.x 0.0.0.0/0*
*1023K 61M ACCEPT all -- eth1 * 10.0.0.0/8
0.0.0.0/0*
* 6224 373K ACCEPT tcp -- eth0 * 0.0.0.0/0
0.0.0.0/0 tcp multiport dports 80,443*
* 308K 183M REJECT all -- * * 0.0.0.0/0
0.0.0.0/0 reject-with icmp-port-unreachable*
2010/11/18 Pablo Delgado Díaz-Pache <delgadop(at)gmail(dot)com>
> Well, we had the problem again. This is what we did
>
> 1) A "ps" to check what processes were running. Many "postmaster" processes
> in defunct state. A few postgres connections still working. A few "psql" (by
> shell script) queries hanged (they are scripts to monitor state of database.
> We use big-sister to do so).
>
> The normal state of the server is 1 postmaster pid and many "postgres"
> connections in "idle" state.
>
> This the result when we were having the problem. I've excluded irrelevant
> processes.
>
> *postgres 892 3889 0 09:07 ? 00:01:05 postgres: postgres
> international 10.19.0.51(49264) idle*
> *postgres 934 3889 0 12:00 ? 00:00:04 [postmaster] <defunct>*
> *postgres 935 3889 0 12:00 ? 00:00:04 [postmaster] <defunct>*
> *postgres 936 3889 0 12:00 ? 00:00:00 [postmaster] <defunct>*
> *postgres 938 3889 0 12:00 ? 00:00:01 [postmaster] <defunct>*
> *postgres 1244 3889 0 12:01 ? 00:00:01 [postmaster] <defunct>*
> *postgres 1453 3889 0 12:02 ? 00:00:02 [postmaster] <defunct>*
> *postgres 1563 3889 0 12:03 ? 00:00:10 [postmaster] <defunct>*
> *postgres 1862 3889 0 12:04 ? 00:00:04 [postmaster] <defunct>*
> *postgres 2275 3889 0 12:06 ? 00:00:02 [postmaster] <defunct>*
> *root 2287 49 0 Oct29 ? 00:00:00 [kondemand/0]*
> *root 2288 49 0 Oct29 ? 00:00:00 [kondemand/1]*
> *postgres 2485 3889 0 12:07 ? 00:00:05 postgres: postgres
> international 10.19.0.51(59066) idle*
> *postgres 2490 3889 1 12:07 ? 00:00:26 postgres: postgres
> international 10.19.0.51(59070) idle*
> *postgres 2568 3889 0 12:07 ? 00:00:00 postgres: postgres
> international 10.19.0.50(1249) idle*
> *postgres 3544 3889 0 12:12 ? 00:00:07 [postmaster] <defunct>*
> *postgres 3875 3889 0 12:14 ? 00:00:03 [postmaster] <defunct>*
> *postgres 3889 1 0 Nov16 ? 00:01:24 /usr/bin/postmaster -p
> 5432 -D /var/lib/pgsql/data*
> *postgres 3896 3889 0 Nov16 ? 00:00:00 postgres: logger process*
> *postgres 3913 3889 0 Nov16 ? 00:00:05 postgres: writer process*
> *postgres 3914 3889 0 Nov16 ? 00:00:00 postgres: wal writer
> process*
> *postgres 3915 3889 0 Nov16 ? 00:00:01 postgres: autovacuum
> launcher process*
> *postgres 3916 3889 0 Nov16 ? 00:01:42 postgres: stats collector
> process*
> *postgres 4169 3889 0 12:15 ? 00:00:02 [postmaster] <defunct>*
> *postgres 4170 3889 0 12:15 ? 00:00:04 [postmaster] <defunct>*
> *postgres 4171 3889 0 12:15 ? 00:00:07 [postmaster] <defunct>*
> *postgres 4281 3889 0 12:15 ? 00:00:04 [postmaster] <defunct>*
> *postgres 4672 3889 0 12:17 ? 00:00:01 [postmaster] <defunct>*
> *postgres 5038 3889 0 12:19 ? 00:00:01 [postmaster] <defunct>*
> *postgres 5203 3889 0 12:20 ? 00:00:00 [postmaster] <defunct>*
> *postgres 5484 3889 0 12:21 ? 00:00:01 [postmaster] <defunct>*
> *postgres 5694 3889 0 12:22 ? 00:00:01 postgres: postgres
> international 10.19.0.49(29113) idle*
> *postgres 5769 3889 0 12:23 ? 00:00:04 [postmaster] <defunct>*
> *postgres 5832 3889 0 12:23 ? 00:00:03 postgres: postgres
> international 10.19.0.49(29129) idle*
> *postgres 5835 3889 0 12:23 ? 00:00:02 [postmaster] <defunct>*
> *postgres 5979 3889 0 12:24 ? 00:00:01 [postmaster] <defunct>*
> *postgres 6243 3889 0 12:25 ? 00:00:00 [postmaster] <defunct>*
> *postgres 6424 3889 0 12:26 ? 00:00:01 postgres: postgres
> international 10.19.0.49(45160) idle*
> *postgres 6425 3889 0 12:26 ? 00:00:01 [postmaster] <defunct>*
> *postgres 6429 3889 0 12:26 ? 00:00:01 [postmaster] <defunct>*
> *postgres 6599 3889 0 12:27 ? 00:00:02 [postmaster] <defunct>*
> *postgres 6635 3889 0 12:27 ? 00:00:00 [postmaster] <defunct>*
> *postgres 6653 3889 0 12:27 ? 00:00:00 [postmaster] <defunct>*
> *postgres 6654 3889 0 12:27 ? 00:00:00 [postmaster] <defunct>*
> *postgres 6655 3889 0 12:27 ? 00:00:00 [postmaster] <defunct>*
> *postgres 6663 3889 0 12:27 ? 00:00:00 [postmaster] <defunct>*
> *postgres 6664 3889 0 12:27 ? 00:00:00 [postmaster] <defunct>*
> *postgres 6665 3889 0 12:27 ? 00:00:00 [postmaster] <defunct>*
> *postgres 6666 3889 0 12:27 ? 00:00:00 [postmaster] <defunct>*
> *postgres 6667 3889 0 12:27 ? 00:00:00 [postmaster] <defunct>*
> *postgres 6668 3889 0 12:27 ? 00:00:00 [postmaster] <defunct>*
> *postgres 6669 3889 0 12:27 ? 00:00:00 [postmaster] <defunct>*
> *postgres 6670 3889 0 12:27 ? 00:00:00 [postmaster] <defunct>*
> *postgres 6671 3889 0 12:27 ? 00:00:00 [postmaster] <defunct>*
> *postgres 6672 3889 0 12:27 ? 00:00:00 [postmaster] <defunct>*
> *postgres 6673 3889 0 12:27 ? 00:00:00 [postmaster] <defunct>*
> *postgres 6773 3889 0 12:28 ? 00:00:00 [postmaster] <defunct>*
> *postgres 6774 3889 0 12:28 ? 00:00:00 [postmaster] <defunct>*
> *postgres 6778 3889 0 12:28 ? 00:00:01 [postmaster] <defunct>*
> *postgres 6779 3889 0 12:28 ? 00:00:00 [postmaster] <defunct>*
> *postgres 6785 3889 0 12:28 ? 00:00:00 [postmaster] <defunct>*
> *postgres 6787 3889 0 12:28 ? 00:00:00 [postmaster] <defunct>*
> *postgres 6788 3889 0 12:28 ? 00:00:00 [postmaster] <defunct>*
> *postgres 6796 3889 0 12:28 ? 00:00:00 [postmaster] <defunct>*
> *postgres 6797 3889 0 12:28 ? 00:00:00 [postmaster] <defunct>*
> *postgres 6837 3889 0 12:28 ? 00:00:00 [postmaster] <defunct>*
> *postgres 6838 3889 0 12:28 ? 00:00:00 [postmaster] <defunct>*
> *postgres 6839 3889 0 12:28 ? 00:00:00 [postmaster] <defunct>*
> *postgres 6840 3889 0 12:28 ? 00:00:00 [postmaster] <defunct>*
> *postgres 6841 3889 0 12:28 ? 00:00:00 [postmaster] <defunct>*
> *postgres 6842 3889 0 12:28 ? 00:00:00 [postmaster] <defunct>*
> *postgres 6843 3889 0 12:28 ? 00:00:00 [postmaster] <defunct>*
> *postgres 6844 3889 0 12:28 ? 00:00:00 [postmaster] <defunct>*
> *postgres 6852 3889 0 12:28 ? 00:00:00 [postmaster] <defunct>*
> *postgres 6859 3889 0 12:28 ? 00:00:00 [postmaster] <defunct>*
> *postgres 6860 3889 0 12:28 ? 00:00:00 [postmaster] <defunct>*
> *postgres 6862 3889 0 12:28 ? 00:00:00 [postmaster] <defunct>*
> *postgres 6872 3889 0 12:28 ? 00:00:00 [postmaster] <defunct>*
> *postgres 6873 3889 0 12:28 ? 00:00:00 [postmaster] <defunct>*
> *postgres 6874 3889 0 12:28 ? 00:00:00 [postmaster] <defunct>*
> *postgres 6910 3889 0 12:29 ? 00:00:00 [postmaster] <defunct>*
> *postgres 6970 3889 0 12:29 ? 00:00:00 [postmaster] <defunct>*
> *postgres 6971 3889 0 12:29 ? 00:00:00 [postmaster] <defunct>*
> *postgres 6972 3889 0 12:29 ? 00:00:00 [postmaster] <defunct>*
> *postgres 6973 3889 0 12:29 ? 00:00:00 [postmaster] <defunct>*
> *postgres 6978 3889 0 12:29 ? 00:00:00 [postmaster] <defunct>*
> *postgres 6990 3889 0 12:29 ? 00:00:00 [postmaster] <defunct>*
> *postgres 6991 3889 0 12:29 ? 00:00:00 [postmaster] <defunct>*
> *postgres 6992 3889 0 12:29 ? 00:00:00 [postmaster] <defunct>*
> *postgres 6993 3889 0 12:29 ? 00:00:00 [postmaster] <defunct>*
> *postgres 7000 3889 0 12:29 ? 00:00:00 [postmaster] <defunct>*
> *postgres 7088 3889 0 12:29 ? 00:00:00 [postmaster] <defunct>*
> *postgres 7089 3889 0 12:29 ? 00:00:00 [postmaster] <defunct>*
> *postgres 7090 3889 0 12:29 ? 00:00:00 [postmaster] <defunct>*
> *postgres 7091 3889 0 12:29 ? 00:00:00 [postmaster] <defunct>*
> *postgres 7094 3889 0 12:29 ? 00:00:00 [postmaster] <defunct>*
> *postgres 7095 3889 0 12:29 ? 00:00:00 [postmaster] <defunct>*
> *postgres 7096 3889 0 12:29 ? 00:00:00 [postmaster] <defunct>*
> *postgres 7122 3889 0 12:29 ? 00:00:00 [postmaster] <defunct>*
> *postgres 7126 3889 0 12:29 ? 00:00:03 postgres: postgres mrs
> 10.19.0.51(7717) idle*
> *postgres 7127 3889 0 12:29 ? 00:00:00 [postmaster] <defunct>*
> *root 7152 7150 0 12:30 ? 00:00:00 /usr/bin/psql -h
> ddbb.priv -U bigsister mrs*
> *root 7157 7155 0 12:30 ? 00:00:00 /usr/bin/psql -h
> ddbb.priv -U bigsister mrs*
> *root 7168 7166 0 12:31 ? 00:00:00 /usr/bin/psql -h
> ddbb.priv -U bigsister mrs*
> *root 7174 7172 0 12:31 ? 00:00:00 /usr/bin/psql -h
> ddbb.priv -U bigsister mrs*
> *root 7188 7186 0 12:32 ? 00:00:00 /usr/bin/psql -h
> ddbb.priv -U bigsister mrs*
> *root 7196 7194 0 12:32 ? 00:00:00 /usr/bin/psql -h
> ddbb.priv -U bigsister mrs*
> *root 7258 7256 0 12:34 ? 00:00:00 /usr/bin/psql -h
> ddbb.priv -U bigsister mrs*
> *root 7263 7261 0 12:34 ? 00:00:00 /usr/bin/psql -h
> ddbb.priv -U bigsister mrs*
> *root 7308 7306 0 12:35 ? 00:00:00 /usr/bin/psql -h
> ddbb.priv -U bigsister mrs*
> *root 7313 7311 0 12:35 ? 00:00:00 /usr/bin/psql -h
> ddbb.priv -U bigsister mrs*
> *root 7325 7323 0 12:36 ? 00:00:00 /usr/bin/psql -h
> ddbb.priv -U bigsister mrs*
> *root 7330 7328 0 12:36 ? 00:00:00 /usr/bin/psql -h
> ddbb.priv -U bigsister mrs*
> *root 7371 7369 0 12:37 ? 00:00:00 /usr/bin/psql -h
> ddbb.priv -U bigsister mrs*
> *postgres 22959 3889 0 11:09 ? 00:00:01 postgres: postgres spain
> 10.19.0.45(57042) idle*
> *postgres 22961 3889 0 11:09 ? 00:00:22 postgres: postgres mirai
> 10.19.0.45(57043) idle*
> *postgres 23225 3889 0 11:10 ? 00:00:00 postgres: postgres xmas
> 10.19.0.45(57061) idle*
> *postgres 25844 3889 0 11:23 ? 00:00:01 postgres: postgres spain
> 10.19.0.51(11291) idle*
> *postgres 31942 3889 0 11:53 ? 00:00:00 [postmaster] <defunct>*
> *postgres 31973 3889 0 11:53 ? 00:00:16 [postmaster] <defunct>*
>
>
> 2) We did a strace to the postmaster pid. However we had 2 postmasters not
> dead
>
> # ps -fea |grep -i postmaster
> postgres 3889 1 0 Nov16 ? 00:01:24 /usr/bin/postmaster -p 5432
> -D /var/lib/pgsql/data
> postgres 7601 3889 0 12:37 ? 00:00:00 /usr/bin/postmaster -p 5432
> -D /var/lib/pgsql/data
>
> As soon as we did a "strace" to the 3889 pid everything started to work
> again. Not sure it was a coincidence but that was the way it was.
>
> *# strace -p 3889*
> *Process 3889 attached - interrupt to quit*
> *select(6, [3 4 5], NULL, NULL, {56, 930000}) = ? ERESTARTNOHAND (To be
> restarted)*
> *--- SIGUSR1 (User defined signal 1) @ 0 (0) ---*
> *rt_sigprocmask(SIG_SETMASK, ~[ILL TRAP ABRT BUS FPE SEGV CONT SYS RTMIN
> RT_1], NULL, 8) = 0*
> *clone(child_stack=0,
> flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD,
> child_tidptr=0x2b4f6591d440) = 7766*
> *rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0*
> *rt_sigreturn(0x2) = -1 EINTR (Interrupted system
> call)*
> *rt_sigprocmask(SIG_SETMASK, ~[ILL TRAP ABRT BUS FPE SEGV CONT SYS RTMIN
> RT_1], NULL, 8) = 0*
> *rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0*
> *select(6, [3 4 5], NULL, NULL, {60, 0}) = ? ERESTARTNOHAND (To be
> restarted)*
> *--- SIGCHLD (Child exited) @ 0 (0) ---*
> *rt_sigprocmask(SIG_SETMASK, ~[ILL TRAP ABRT BUS FPE SEGV CONT SYS RTMIN
> RT_1], NULL, 8) = 0*
> *wait4(-1, [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], WNOHANG, NULL) = 7766*
> *wait4(-1, 0x7fffc37259a4, WNOHANG, NULL) = 0*
> *rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0*
> *rt_sigreturn(0x2) = -1 EINTR (Interrupted system
> call)*
> *rt_sigprocmask(SIG_SETMASK, ~[ILL TRAP ABRT BUS FPE SEGV CONT SYS RTMIN
> RT_1], NULL, 8) = 0*
> *rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0*
> *select(6, [3 4 5], NULL, NULL, {60, 0}) = ? ERESTARTNOHAND (To be
> restarted)*
> *--- SIGUSR1 (User defined signal 1) @ 0 (0) ---*
> *rt_sigprocmask(SIG_SETMASK, ~[ILL TRAP ABRT BUS FPE SEGV CONT SYS RTMIN
> RT_1], NULL, 8) = 0*
> *clone(child_stack=0,
> flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD,
> child_tidptr=0x2b4f6591d440) = 7767*
> *rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0*
> *rt_sigreturn(0x2) = -1 EINTR (Interrupted system
> call)*
> *rt_sigprocmask(SIG_SETMASK, ~[ILL TRAP ABRT BUS FPE SEGV CONT SYS RTMIN
> RT_1], NULL, 8) = 0*
> *rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0*
> *select(6, [3 4 5], NULL, NULL, {60, 0}) = ? ERESTARTNOHAND (To be
> restarted)*
> *--- SIGCHLD (Child exited) @ 0 (0) ---*
>
>
> I also straced the other postmaster pid
>
> *# strace -p 7601*
> *Process 7601 attached - interrupt to quit*
> *recvfrom(8, "P\0\0\0\221\0select id_key from transla"..., 8192, 0, NULL,
> NULL) = 181*
> *brk(0x14ac4000) = 0x14ac4000*
> *lseek(3, 0, SEEK_END) = 16130048*
> *lseek(5, 0, SEEK_END) = 4661248*
> *lseek(3, 0, SEEK_END) = 16130048*
> *brk(0x14ae9000) = 0x14ae9000*
> *brk(0x14adb000) = 0x14adb000*
> *brk(0x14ad7000) = 0x14ad7000*
> *sendto(7,
> "\2\0\0\0\230\1\0\0H\203\314\0\4\0\0\0\5\0\0\0\0\0\0\0X\203\314\0\0\0\0\0"...,
> 408, 0, NULL, 0) = 408*
> *sendto(8,
> "1\0\0\0\0042\0\0\0\4T\0\0\0\37\0\1id_key\0\0\314\203X\0\4\0\0"..., 74, 0,
> NULL, 0) = 74*
> *recvfrom(8, "P\0\0\0\213\0select id_key from transla"..., 8192, 0, NULL,
> NULL) = 175*
> *brk(0x14ac4000) = 0x14ac4000*
> *lseek(3, 0, SEEK_END) = 16130048*
> *lseek(5, 0, SEEK_END) = 4661248*
> *lseek(3, 0, SEEK_END) = 16130048*
> *brk(0x14ae9000) = 0x14ae9000*
> *brk(0x14adb000) = 0x14adb000*
> *brk(0x14ad7000) = 0x14ad7000*
> *sendto(8,
> "1\0\0\0\0042\0\0\0\4T\0\0\0\37\0\1id_key\0\0\314\203X\0\4\0\0"..., 73, 0,
> NULL, 0) = 73*
> *recvfrom(8, <unfinished ...>*
> *Process 7601 detached*
>
>
> 3) Checked the logs. We could see the same error lines we had before, but
> they dissapeared after just a few minutes.
>
> *2010-11-18 12:30:02.830
> CET|1|mirai|10.19.0.43(4452)|3875|idle|4ce50a7c.f23|2010-11-18 12:14:04
> CET|67/0|0|postgres| LOG: unexpected EOF on client connection*
> *2010-11-18 12:30:02.841
> CET|1|mirai|10.19.0.43(13943)|2275|idle|4ce508b4.8e3|2010-11-18 12:06:28
> CET|75/0|0|postgres| LOG: unexpected EOF on client connection*
> *2010-11-18 12:30:47.628 CET|2|||3915||4ce2d008.f4b|2010-11-16 19:40:08
> CET||0|| WARNING: worker took too long to start; cancelled*
> *2010-11-18 12:31:48.056 CET|3|||3915||4ce2d008.f4b|2010-11-16 19:40:08
> CET||0|| WARNING: worker took too long to start; cancelled*
> *2010-11-18 12:32:48.184 CET|4|||3915||4ce2d008.f4b|2010-11-16 19:40:08
> CET||0|| WARNING: worker took too long to start; cancelled*
> *2010-11-18 12:33:48.614 CET|5|||3915||4ce2d008.f4b|2010-11-16 19:40:08
> CET||0|| WARNING: worker took too long to start; cancelled*
> *2010-11-18 12:34:48.747 CET|6|||3915||4ce2d008.f4b|2010-11-16 19:40:08
> CET||0|| WARNING: worker took too long to start; cancelled*
> *2010-11-18 12:35:49.277 CET|7|||3915||4ce2d008.f4b|2010-11-16 19:40:08
> CET||0|| WARNING: worker took too long to start; cancelled*
> *2010-11-18 12:36:49.706 CET|8|||3915||4ce2d008.f4b|2010-11-16 19:40:08
> CET||0|| WARNING: worker took too long to start; cancelled*
> *2010-11-18 12:37:30.462
> CET|1|spain|10.19.0.48(11276)|7414|idle|4ce50ffa.1cf6|2010-11-18 12:37:30
> CET|2/0|0|spainadmin| LOG: unexpected EOF on client connection*
>
>
> Any Ideas?
>
> Thanks for your help and feedback!
>
> Pablo
>
> 2010/11/16 Alvaro Herrera <alvherre(at)commandprompt(dot)com>
>
>> Excerpts from Pablo Delgado Díaz-Pache's message of mar nov 16 12:18:09
>> -0300 2010:
>>
>> > Not sure what you mean.
>> >
>> > Once we start getting that error in the postgres log, there is no
>> autovacuum
>> > entry. Only that error message once every minute (and as i wrote in my
>> last
>> > email, the same message all the time).
>>
>> Yeah. This means that the autovacuum launcher daemon is still running
>> and trying to schedule new jobs, but it's not able to for some reason.
>> When it tries to start a new one, it sees the old one still waiting to
>> start running; and then it logs this message.
>>
>> To start a new job, what it does is send a signal to postmaster and wait
>> for it to create a child process. This is the part that's not working.
>>
>> > The question is what is causing postgres to stop working. Autovacuum or
>> > another reason? If it's another reason ... what could it be?
>>
>> I have no idea what's going on. I have never seen this problem. If you
>> see it happen again, please get a stacktrace from postmaster; and/or
>> strace it.
>>
>> --
>> Álvaro Herrera <alvherre(at)commandprompt(dot)com>
>> The PostgreSQL Company - Command Prompt, Inc.
>> PostgreSQL Replication, Consulting, Custom Development, 24x7 support
>>
>
>
From | Date | Subject | |
---|---|---|---|
Next Message | Alvaro Herrera | 2010-11-18 14:40:53 | Re: Autovacuum seems to block database: WARNING worker took too long to start |
Previous Message | Pablo Delgado Díaz-Pache | 2010-11-18 11:57:16 | Re: Autovacuum seems to block database: WARNING worker took too long to start |