PostgreSQL hanging on new connections?

From: hubert depesz lubaczewski <depesz(at)depesz(dot)com>
To: pgsql-general(at)postgresql(dot)org
Subject: PostgreSQL hanging on new connections?
Date: 2010-11-29 17:18:40
Message-ID: 20101129171839.GA24895@depesz.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

hi,
got (another time, it's not the first thing) this situation:
1. server: 64gb of ram, 24 cores of xeon processors (hyperthreading is off), storage is fusion io.
2. at one point load skyrocketed to 48 (i know, it's not high with this
hardware)
3. we were not able to connect to PostgreSQL from psql - psql when run
just froze.
4. postgresql: PostgreSQL 8.3.12 on x86_64-redhat-linux-gnu, compiled by
GCC gcc (GCC) 4.1.2 20080704 (Red Hat 4.1.2-48)

What I did gather:
straced postmaster when the problem was happening, and I was opening new
connections. strace looks like this:

16:54:36.886720 select(6, [3 4 5], NULL, NULL, {54, 366000}) = 1 (in [4], left {51, 563000})
16:54:39.768318 rt_sigprocmask(SIG_SETMASK, ~[ILL TRAP ABRT BUS FPE SEGV CONT SYS RTMIN RT_1], NULL, 8) = 0
16:54:39.770795 accept(4, {sa_family=AF_INET, sin_port=htons(59661), sin_addr=inet_addr("10.50.50.7")}, [16]) = 8
16:54:39.773895 getsockname(8, {sa_family=AF_INET, sin_port=htons(5432), sin_addr=inet_addr("10.60.0.10")}, [16]) = 0
16:54:39.774522 setsockopt(8, SOL_TCP, TCP_NODELAY, [1], 4) = 0
16:54:39.774579 setsockopt(8, SOL_SOCKET, SO_KEEPALIVE, [1], 4) = 0
16:54:39.774729 clone(child_stack=0, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x2b06874a4460) = 12891
16:54:39.854208 close(8) = 0
16:54:39.855800 rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
16:54:39.856853 select(6, [3 4 5], NULL, NULL, {60, 0}) = 1 (in [4], left {57, 842000})
16:54:42.018570 rt_sigprocmask(SIG_SETMASK, ~[ILL TRAP ABRT BUS FPE SEGV CONT SYS RTMIN RT_1], NULL, 8) = 0
16:54:42.027079 accept(4, {sa_family=AF_INET, sin_port=htons(59662), sin_addr=inet_addr("10.50.50.7")}, [16]) = 8
16:54:42.046511 getsockname(8, {sa_family=AF_INET, sin_port=htons(5432), sin_addr=inet_addr("10.60.0.10")}, [16]) = 0
16:54:42.095917 setsockopt(8, SOL_TCP, TCP_NODELAY, [1], 4) = 0
16:54:42.109190 setsockopt(8, SOL_SOCKET, SO_KEEPALIVE, [1], 4) = 0
16:54:42.111603 clone(child_stack=0, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x2b06874a4460) = 12912
16:54:42.114379 close(8) = 0
16:54:42.120551 rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
16:54:42.128922 select(6, [3 4 5], NULL, NULL, {60, 0}) = 1 (in [4], left {58, 593000})

so, everything looks right.

but - I was also stracing backends - just after creation (I did strace -ff -p <postmaster.pid> -o x.log).

strace from single backend, just forked off:

16:54:39.852638 close(3) = 0
16:54:39.852823 close(4) = 0
16:54:39.852872 close(5) = 0
16:54:39.852910 close(6) = 0
16:54:39.852994 setsid() = 12891
16:54:39.853050 rt_sigaction(SIGTERM, {0x5ae490, [], SA_RESTORER|SA_RESTART, 0x3bd66302d0}, {0x588580, [], SA_RESTORER|SA_RESTART, 0x3bd66302d0}, 8) = 0
16:54:39.853169 rt_sigaction(SIGQUIT, {0x5ae490, [], SA_RESTORER|SA_RESTART, 0x3bd66302d0}, {0x588580, [], SA_RESTORER|SA_RESTART, 0x3bd66302d0}, 8) = 0
16:54:39.853215 rt_sigaction(SIGALRM, {0x5ae490, [], SA_RESTORER, 0x3bd66302d0}, {0x1, [], SA_RESTORER, 0x3bd66302d0}, 8) = 0
16:54:39.853259 rt_sigprocmask(SIG_SETMASK, ~[QUIT ILL TRAP ABRT BUS FPE SEGV ALRM TERM CONT SYS RTMIN RT_1], NULL, 8) = 0
16:54:39.853504 write(2, "\0\0^\0[2\0\0t2010-11-29 16:54:39 UTC [12891]: [1-1] LOG: connection received: host=10.50.50.7 port=59661\n", 103) = 103
16:54:39.853622 setitimer(ITIMER_REAL, {it_interval={0, 0}, it_value={60, 0}}, NULL) = 0
16:54:39.853689 recvfrom(8, "\0\0\0\10\4\322\26/", 8192, 0, NULL, NULL) = 8
16:54:39.853822 sendto(8, "N", 1, 0, NULL, 0) = 1
16:54:39.853901 recvfrom(8, "\0\0\0%\0\3\0\0user\0web\0database\0xxxxxxxxx\0\0", 8192, 0, NULL, NULL) = 37
16:54:39.856007 setitimer(ITIMER_REAL, {it_interval={0, 0}, it_value={0, 0}}, NULL) = 0
16:54:39.856059 rt_sigprocmask(SIG_SETMASK, ~[ILL TRAP ABRT BUS FPE SEGV CONT SYS RTMIN RT_1], NULL, 8) = 0
16:54:39.856130 write(2, "\0\0r\0[2\0\0t2010-11-29 16:54:39 UTC 10.50.50.7(59661) [12891]: [2-1] LOG: connection authorized: user=web database=xxxxxxxxx\n", 123) = 123
16:54:39.856255 rt_sigaction(SIGHUP, {0x5ad340, [], SA_RESTORER|SA_RESTART, 0x3bd66302d0}, {0x588a90, [], SA_RESTORER|SA_RESTART, 0x3bd66302d0}, 8) = 0
16:54:39.856315 rt_sigaction(SIGINT, {0x5ae350, [], SA_RESTORER|SA_RESTART, 0x3bd66302d0}, {0x588580, [], SA_RESTORER|SA_RESTART, 0x3bd66302d0}, 8) = 0
16:54:39.856359 rt_sigaction(SIGTERM, {0x5ae3f0, [], SA_RESTORER|SA_RESTART, 0x3bd66302d0}, {0x5ae490, [], SA_RESTORER|SA_RESTART, 0x3bd66302d0}, 8) = 0
16:54:39.856402 rt_sigaction(SIGQUIT, {0x5ae4a0, [], SA_RESTORER|SA_RESTART, 0x3bd66302d0}, {0x5ae490, [], SA_RESTORER|SA_RESTART, 0x3bd66302d0}, 8) = 0
16:54:39.856445 rt_sigaction(SIGALRM, {0x5a6160, [], SA_RESTORER, 0x3bd66302d0}, {0x5ae490, [], SA_RESTORER, 0x3bd66302d0}, 8) = 0
16:54:39.856484 rt_sigaction(SIGPIPE, {0x1, [], SA_RESTORER|SA_RESTART, 0x3bd66302d0}, {0x1, [], SA_RESTORER|SA_RESTART, 0x3bd66302d0}, 8) = 0
16:54:39.856526 rt_sigaction(SIGUSR1, {0x5a1740, [], SA_RESTORER|SA_RESTART, 0x3bd66302d0}, {0x5888b0, [], SA_RESTORER|SA_RESTART, 0x3bd66302d0}, 8) = 0
16:54:39.856565 rt_sigaction(SIGUSR2, {0x4d7030, [], SA_RESTORER|SA_RESTART, 0x3bd66302d0}, {0x584a50, [], SA_RESTORER|SA_RESTART, 0x3bd66302d0}, 8) = 0
16:54:39.856819 rt_sigaction(SIGFPE, {0x5ad7b0, [], SA_RESTORER|SA_RESTART, 0x3bd66302d0}, {SIG_DFL, [], 0}, 8) = 0
16:54:39.856929 rt_sigaction(SIGCHLD, {SIG_DFL, [], SA_RESTORER|SA_RESTART|SA_NOCLDSTOP, 0x3bd66302d0}, {0x588050, [], SA_RESTORER|SA_RESTART|SA_NOCLDSTOP, 0x3bd66302d0}, 8) = 0
16:54:39.856978 rt_sigprocmask(SIG_SETMASK, ~[QUIT ILL TRAP ABRT BUS FPE SEGV CONT SYS RTMIN RT_1], NULL, 8) = 0
16:54:39.857043 mmap(NULL, 2101248, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x2b0796de9000
16:54:39.944561 semctl(7602213, 5, SETVAL, 0) = 0
16:54:39.944644 brk(0x3fbe000) = 0x3fbe000
16:54:39.944694 open("global/pg_database", O_RDONLY) = 3
16:54:39.944758 fstat(3, {st_mode=S_IFREG|0600, st_size=270, ...}) = 0
16:54:39.944811 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x2b0796fea000
16:54:39.944843 read(3, "\"template1\" 1 1663 456490577\n\"template0\" 11510 1663 527228657\n\"postgres\" 11511 1663 506289897\n\"xxxxxxxxx\" 16387 1663 457831935\n\"yyyy\" 31851 1663 427191013\n\"aaaa\" 31853 1663 427191021\n\"bbbb\" 31854 1663 427191048\n\"cccc\" 31855 1663 427191069\n\"ddddd\" 1477300 1"..., 4096) = 270
16:54:39.944921 close(3) = 0
16:54:39.944954 munmap(0x2b0796fea000, 4096) = 0
16:54:39.945088 semop(7602213, 0x7fffceae7110, 1

and that's all - backend was not showing anything else.

All other backends that were created at this time were doing the same thing - waiting on this semop.

What's more - when I tried to stop pg (pg_ctl -m fast), all backends died except for one.

This single backend was connected from very remote location, with not really good network connectivity between it and postgresql server.

I did strace this single backend, but then I had to pg_ctl -m immediate.

strace showed:

23540 sendto(8, "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"..., 838, 0, NULL, 0) = ? ERESTARTSYS (To be restarted)
23540 --- SIGQUIT (Quit) @ 0 (0) ---
23540 rt_sigprocmask(SIG_SETMASK, ~[QUIT ILL TRAP ABRT BUS FPE SEGV CONT SYS RTMIN RT_1], NULL, 8) = 0
23540 write(2, "\0\0\34\2\364[\0\0t2010-11-29 16:58:22 UTC AA.BB.CC.DD(60835) [23540]: [623-1] WARNING: terminating connection because of crash of another server process\n2010-11-29 16:58:22 UTC AA.BB.CC.DD(60835) [23540]: [624-1] DETAIL: The postmaster has commanded this server p"..., 549) = 549
23540 write(2, "\0\0\204\2\364[\0\0t2010-11-29 16:58:22 UTC AA.BB.CC.DD(60835) [23540]: [626-1] FATAL: terminating connection due to administrator command\n2010-11-29 16:58:22 UTC AA.BB.CC.DD(60835) [23540]: [627-1] STATEMENT: SELECT table.id, table.some_f1, table.some_f2, "..., 653) = 653
23540 sendto(7, "\1\0\0\0X\2\0\0\3(at)\0\0\6\0\0\0\1\0\0\0\1\0\0\0p(dot)-\0\0\0\0\0\1\0\0\0\0\0\0\0\6\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\1\0\0\0\0\0\0\0\1\0\0\0\0\0\0\0\212I\0\0\377\177\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\6\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0>\200\0\0\377\177\0\0\6\0\0\0\0\0\0\0\6\0\0\0\0\0\0\0\6\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 600, 0, NULL, 0) = 600
23540 --- SIGQUIT (Quit) @ 0 (0) ---
23540 rt_sigprocmask(SIG_SETMASK, ~[QUIT ILL TRAP ABRT BUS FPE SEGV CONT SYS RTMIN RT_1], NULL, 8) = 0
23540 write(2, "\0\0\34\2\364[\0\0t2010-11-29 16:58:22 UTC AA.BB.CC.DD(60835) [23540]: [628-1] WARNING: terminating connection because of crash of another server process\n2010-11-29 16:58:22 UTC AA.BB.CC.DD(60835) [23540]: [629-1] DETAIL: The postmaster has commanded this server p"..., 549) = 549
23540 exit_group(2) = ?

Now, the question is: why did it hang? Is there anything we can do to
make it *not* hang?

Best regards,

depesz

--
Linkedin: http://www.linkedin.com/in/depesz / blog: http://www.depesz.com/
jid/gtalk: depesz(at)depesz(dot)com / aim:depeszhdl / skype:depesz_hdl / gg:6749007

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Tom Lane 2010-11-29 18:23:00 Re: PostgreSQL hanging on new connections?
Previous Message Raymond O'Donnell 2010-11-29 17:02:30 Re: Question about OID and TCID