Re: PGError: server closed the connection unexpectedly

From: Dave Steinberg <dave-pgsql(at)redterror(dot)net>
To: Doug McNaught <doug(at)mcnaught(dot)org>
Cc: pgsql-general(at)postgresql(dot)org
Subject: Re: PGError: server closed the connection unexpectedly
Date: 2006-01-07 22:16:56
Message-ID: 43C03DD8.2010604@redterror.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

>>My biggest problem is the lack of any real error message on the server.
>> I don't see anything wrong in the system logs, and there's no core
>>file in the /var/postgresql directory.
>
>
> Are you sure core files are enabled; i.e. the server is running with
> 'ulimit -c unlimited' ?

Yes:

$ whoami
_postgresql
$ ulimit -c
unlimited

>> I did a 'vacuumdb -afz' just as
>>a shot in the dark, without affect. Pretty much all I see in the logs
>>is this:
>>
>><USER%DB x.y.z.a(51478) 487>LOG: unexpected EOF on client connection
>
>
> This means a client is dying or closing its connection prematurely,
> and would seem to be a different problem. It shouldn't ever cause the
> server to crash.

If the server is crashing - I have absolutely no info to suggest that,
except possibly this: at a later time I did a tcpdump of one of these
sessions, for lack of better ideas, and saw this near the end:

<snip lots of tcpdump output>
16:32:19.523842 clam.int.geekisp.com.15245 >
morningsun.int.geekisp.com.postgresql: P 17389:18036(647) ack 9323 win
16384 <nop,nop,timestamp 133396667 3702263885> (DF)
16:32:19.524289 morningsun.int.geekisp.com.postgresql >
clam.int.geekisp.com.15245: P 9323:9343(20) ack 18036 win 17376
<nop,nop,timestamp 3702263885 133396667> (DF)
16:32:19.562544 clam.int.geekisp.com.15245 >
morningsun.int.geekisp.com.postgresql: P 18036:18041(5) ack 9343 win
16384 <nop,nop,timestamp 133396667 3702263885> (DF)
******16:32:19.566662 morningsun.int.geekisp.com.postgresql >
clam.int.geekisp.com.15245: F 9343:9343(0) ack 18041 win 17376
<nop,nop,timestamp 3702263885 133396667> (DF)
16:32:19.567043 clam.int.geekisp.com.15245 >
morningsun.int.geekisp.com.postgresql: . ack 9344 win 16384
<nop,nop,timestamp 133396667 3702263885> (DF)
16:32:19.649445 clam.int.geekisp.com.15245 >
morningsun.int.geekisp.com.postgresql: P 18041:18851(810) ack 9344 win
16384 <nop,nop,timestamp 133396667 3702263885> (DF)
16:32:19.649468 morningsun.int.geekisp.com.postgresql >
clam.int.geekisp.com.15245: R 545360451:545360451(0) win 0 (DF)
16:32:19.649652 clam.int.geekisp.com.15245 >
morningsun.int.geekisp.com.postgresql: F 18851:18851(0) ack 9344 win
16384 <nop,nop,timestamp 133396667 3702263885> (DF)
16:32:19.649665 morningsun.int.geekisp.com.postgresql >
clam.int.geekisp.com.15245: R 545360451:545360451(0) win 0 (DF)

The '*****' above is my own, to highlight the interesting part.
Morningsun is the server, clam is the ruby client. Based on this, I
think the server *might* be dying first (hard to tell), since morningsun
is the one who sends the FIN packet and thereby closes the connection.
(Sorry for the horrible wrapping).

Again, having trouble making sure.

>>Googling turned up a few reports suggesting bad hardware, or corrupted
>>indexes, but I don't think that's the case here.
>>
>>Any starting points or ideas would be greatly appreciated.
>
>
> Make sure the server is able to dump a core file, and perhaps crank up
> the logging level.

Just to verify that a core dump is possible for the postgresql user, I
wrote a tiny C program that just calls abort(). Sure enough, running it
gives me a core file.

I cranked the debug level up to 5 (man that's a lot), and here's what I
think is the relevent chunk:

<geekispv2%geekisp-v2-test 192.168.4.38(31992) 9155>STATEMENT: UPDATE
job_members SET "start_time" = NULL, "exit_code" = NULL, "create_time" =
'2006-01-07 17:10:31', "job_id" = 30, "command" = '---
!ruby/object:GeekISP::ShellCommand
args:
command: /bin/ls /tmp/
exit_code:
logger: !ruby/object:Logger
datetime_format:
level: 0
logdev: !ruby/object:Logger::LogDevice
dev: !ruby/object:File {}
filename:
/mnt/scratch/dave/control_panel/trunk/config/../log/test.log
shift_age: 0
shift_size: 1048576
progname:
run_as:
stderr:
stdin:
stdout: ', "finish_time" = NULL, "last_update_time" = NULL,
"host_id" = 4, "status" = 'in_progress', "output" = NULL WHERE
job_member_id = 105
<geekispv2%geekisp-v2-test 192.168.4.38(31992) 9155>DEBUG: name:
unnamed; blockState: STARTED; state: INPROGR, xid/subid/cid:
10290432/1/0, nestlvl: 1, children: <>
<geekispv2%geekisp-v2-test 192.168.4.38(31992) 9155>STATEMENT: UPDATE
job_members SET "start_time" = NULL, "exit_code" = NULL, "create_time" =
'2006-01-07 17:10:31', "job_id" = 30, "command" = '---
!ruby/object:GeekISP::ShellCommand
args:
command: /bin/ls /tmp/
exit_code:
logger: !ruby/object:Logger
datetime_format:
level: 0
logdev: !ruby/object:Logger::LogDevice
dev: !ruby/object:File {}
filename:
/mnt/scratch/dave/control_panel/trunk/config/../log/test.log
shift_age: 0
shift_size: 1048576
progname:
run_as:
stderr:
stdin:
stdout: ', "finish_time" = NULL, "last_update_time" = NULL,
"host_id" = 4, "status" = 'in_progress', "output" = NULL WHERE
job_member_id = 105
<geekispv2%geekisp-v2-test 192.168.4.38(31992) 9155>DEBUG: proc_exit(0)
<geekispv2%geekisp-v2-test 192.168.4.38(31992) 9155>DEBUG: shmem_exit(0)
<geekispv2%geekisp-v2-test 192.168.4.38(31992) 9155>LOG: disconnection:
session time: 0:00:00.84 user=geekispv2 database=geekisp-v2-test
host=192.168.4.38 port=31992
<geekispv2%geekisp-v2-test 192.168.4.38(31992) 9155>DEBUG: exit(0)
<% 7303>DEBUG: reaping dead processes
<% 7303>DEBUG: server process (PID 9155) exited with exit code 0

That looks to me like a clean and normal exit. This is pointing more
and more towards the client in the ruby case, isn't it?

Regards,
--
Dave Steinberg
http://www.geekisp.com/
http://www.steinbergcomputing.com/

In response to

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Angshu Kar 2006-01-07 22:27:19 programming in pgsql
Previous Message Angshu Kar 2006-01-07 22:07:03 Re: COPY to