Bug #798: Data in table causes server to terminate with signal 11

From: pgsql-bugs(at)postgresql(dot)org
To: pgsql-bugs(at)postgresql(dot)org
Subject: Bug #798: Data in table causes server to terminate with signal 11
Date: 2002-10-16 18:14:15
Message-ID: 20021016181415.6E6AF475F25@postgresql.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

Dan Wilder (dan(at)ssc(dot)com) reports a bug with a severity of 4
The lower the number the more severe it is.

Short Description
Data in table causes server to terminate with signal 11

Long Description
I have a table which brings about the error message (from syslog)

server process (pid 21503) was terminated by signal 11

on any attempt to select certain fields from a certain record
in a certain table.

The application has been running since June with no such error
indication. Problem cropped up some time during the night,
possibly following a run of vacuumdb. The dump_db program fails
a few records before the apparent problem record.

I upgraded to a local build of postgresql-7.2.3 and found the
same problem.

Any field in the problem record up to a certain field can be
selected. Any attempt to select the problem field or any
field following it in the record yields the error message.

Any other record in the table can be selected. vacuumdb
from 7.2.3 runs successfully and gives no indication of
anything amiss.

After shutting down the database and saving a snapshot of the
data, I restarted the 7.2.3 version and under psql deleted
the problem record then reinserted it using copies of its values
from an earlier snapshot of the database. All is fine now
with the running copy.

I still have the snapshot of the database files taken prior to the
deletion, have verified that I can start a copy of postmaster
running against this data, and that the problem can be reproduced.

In the mailing list archives I find several references to "signal 11"
with inconclusive resolutions, including hints about problem
hardware. I do not believe hardware is the cause of this difficulty,
because the application ran successfully for so long prior to this
incident, because the problem is so perfectly reproducible,
because the machine in question runs so smoothly otherwise,
and has done many builds of kernels and other software, because the
ECC RAM being used in the machine has completed 30 passes of
memtest86, and because the problem went away when I deleted and
then reinserted the offending record. The table in question
has records that are written once and never updated, and the
problem record was written in June. dump_db runs every night
and has not previously aborted. Since deletion of the offending
record, dump_db has run successfully, twice.

Running Postgresql 7.2 on Debian Linux 3.0 (woody), default
Debian install, when the problem was observed. Immediate clue
to the problem was email from a cron job:

From: root(at)ssc(dot)com (Cron Daemon)
To: subs(at)ssc(dot)com
Subject: Cron <subs(at)wheel> /usr/lib/orcss/admin/dump.sh
X-Cron-Env: <SHELL=/bin/sh>
X-Cron-Env: <HOME=/home/subs>
X-Cron-Env: <PATH=/usr/bin:/bin>
X-Cron-Env: <LOGNAME=subs>
Date: Mon, 14 Oct 2002 22:34:18 -0700 (PDT)

pg_dump: connection not open
pg_dump: lost synchronization with server, resetting connection
pg_dump: SQL command to dump the contents of table "tender" failed: PQendcopy()
+failed.
pg_dump: Error message from server: FATAL 1: The database system is starting up
pg_dump: The command was: COPY "tender" TO stdout;

Corresponding syslog entries, wrapped (I hope) for readability:

Oct 14 22:34:17 wheel postgres[505]: [1] DEBUG:
server process (pid 19015) was terminated by signal 11

Oct 14 22:34:17 wheel postgres[505]: [2] DEBUG:
terminating any other active server processes

Oct 14 22:34:17 wheel postgres[505]: [3] DEBUG:
all server processes terminated; reinitializing
shared memory and semaphores

Oct 14 22:34:17 wheel postgres[19039]: [4] DEBUG:
database system was interrupted at 2002-10-14 22:31:39 PDT

Oct 14 22:34:17 wheel postgres[19040]: [4] FATAL 1:
The database system is starting up

Oct 14 22:34:17 wheel postgres[19039]: [5] DEBUG:
checkpoint record is at 4/B5F438B4

Oct 14 22:34:17 wheel postgres[19039]: [6] DEBUG:
redo recordundo record is at 4/B5F438B4;
undo record is at 0/0; shutdown FALSE

Oct 14 22:34:17 wheel postgres[19039]: [7] DEBUG:
next transaction id: 195169;
next oid: 57807563

Oct 14 22:34:17 wheel postgres[19039]: [8] DEBUG:
database system was not properly shut down;
automatic recovery in progress

Oct 14 22:34:17 wheel postgres[19039]: [9] DEBUG:
ReadRecord: record with zero length at 4/B5F438F4

Oct 14 22:34:17 wheel postgres[19039]: [10] DEBUG:
redo is not required

Oct 14 22:34:21 wheel postgres[19039]: [11] DEBUG:
database system is ready

Attempting to run dump_db by hand, twice, yielded the
same result, with the same last record dumped in each case.

Later attempts at queries revealed server crashes
at any attempt to query record with primary key
(tender_id) 12776 involving any field appearing later
in the schema than (including) the problem field.

Typical syslog entries with debug level 4, timestamps
and "DEBUG:" omitted:

connection: host=[local] user=scott database=subs
InitPostgres
StartTransactionCommand
query: select getdatabaseencoding()
ProcessQuery
CommitTransactionCommand
StartTransactionCommand
query: SELECT usesuper FROM pg_user WHERE usename = 'scott'
ProcessQuery
CommitTransactionCommand
StartTransactionCommand
query: select is_refund from tender where tender_id = 12776;
ProcessQuery
CommitTransactionCommand
StartTransactionCommand
query: select card_number from tender where tender_id = 12776;
ProcessQuery
CommitTransactionCommand
StartTransactionCommand
query: select verf_state from tender where tender_id = 12776;
ProcessQuery
reaping dead processes
child process (pid 25363) was terminated by signal 11
server process (pid 25363) was terminated by signal 11
terminating any other active server processes
all server processes terminated;
reinitializing shared memory and semaphores
shmem_exit(0)
BackendStartup: forked pid=25371 socket=9
database system was interrupted at 2002-10-15 15:09:58 PDT
checkpoint record is at 4/D02C89D8
[ etc ]

If there is any interest in this problem and more information
is wanted, I am willing to start up a copy of postmaster against
the saved data for the purpose of gathering more information.
Unfortunately I am not at liberty to provide a complete copy
of the saved data. If anybody knows of a way to obtain data
from the offending record I am willing to do so, and provide
the results.

Sample Code

No file was uploaded with this report

Browse pgsql-bugs by date

  From Date Subject
Next Message Cristóvão Dalla Costa 2002-10-16 20:43:25 bug in create table with serial
Previous Message Dietrich 2002-10-16 13:49:46 Optimization disaster