Perl script failure => Postgres 7.1.2 database corruption

From: Frank McKenney <frank_mckenney(at)mindspring(dot)com>
To: "PostgreSQL Bug List" <pgsql-bugs(at)postgresql(dot)org>
Cc: Ethan Burnside <support(at)kattare(dot)com>
Subject: Perl script failure => Postgres 7.1.2 database corruption
Date: 2001-11-08 23:09:59
Message-ID: 200111090409.XAA00267@maynard.mail.mindspring.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs


Severity
--------

Major annoyance (having to rebuild a database)... unless there is a
simple workaround I was too ignorant to apply (;-).

Problem
-------

A few days ago, for no obvious reason, a Perl script used to update
the database portion of a 'web site suddenly stopped working.
Several hours later it became apparent that "idle" Postgres
processes were queuing up on the Postgres server machine. Still
later it became obvious that at least one table in the database was
damaged to the point that even pg_dump would not dump it properly.

The 'web site uses Perl CGI (Apache mod_perl) and DBD::Pg to access
Postgres database on a separate server.

Symptoms
--------

- Backed-up hung ("idle") Postgres processes.
- Undeletable table in database (apparently damaged).
- pg_dump failure.
- psql errors accessing table.
- Perl DBI errors reported in Apache/mod_perl error log.

Background
----------

The Perl script involved here updates the Postgres database from
Win32 .DBF files using DBD:Pg and DBD::Xbase. The script has run
successfully for several weeks, and in fact ran successfully using
the same set of DBF files last Friday.

In order to minimize effect of the update on users who may be using
the database, the script performs the entire database update as a
single transaction. In Perl::DBI terms, it issues a

$obj = DBI->connect($db_datasource,
$db_username,
$db_password,
{AutoCommit => 0});

and when the updates have been added without error (which usually
happens) it issues a

$obj->commit();
$obj->disconnect();
$obj = undef;

This time the update died partway through:

+ Marking old records for deletion.
(At this point the script was in the process of issuing
DELETE FROM TABLE xxxxx statements)
DBD::Pg::db do failed: pqReadData() -- backend closed the
channel unexpectedly. at ./update-foundation-db.pl line 244.

Could not purge old records from summary, stopping at
./update-foundation-db.pl line 244.

What triggered the first error is still not completely clear, but it
may have been the result of the account exceeding its disk space
quota. However, even after some 40Mb of files were deleted and the
quota was increased by 50Mb, the script continued to fail at the
same point.

The only mildly bright spot in all this is that none of this
appeared to affect 'web site users. SQL queries from site users
continued to yield results throughout (well, up to the point where I
deleted the database (;-)).

Other messages reported by the script included:

DBD::Pg::db do failed: ERROR: heap_delete: (am)invalid tid at
./update-foundation-db.pl line 244.

DBD::Pg::db do failed: ERROR: simple_heap_delete: tuple
concurrently updated at ./update-foundation-db.pl line 244.

DBD::Pg::db do failed: ERROR: Relation 1632370768 does not
exist at ./update-foundation-db.pl line 244.

DBD::Pg::db do failed: ERROR: Relation 419450449 does not exist
at ./update-foundation-db.pl line 244.

At the same time, "idle" Postgres processes were backing up on the
Postgres server. These only cleared when the Postgres server
process was restarted.
-------
10626 ? S 0:00 postgres: websiteadmin websiteadmin
206.163.128.1 idle in transaction
(and 8 more like it at one point)
-------

Postgres access from the 'web site logged errors to the Apache error
log for the site's virtual host:
-----------
NOTICE: Message from PostgreSQL backend:
The Postmaster has informed me that some other backend died
abnormally and possibly corrupted shared memory.
I have rolled back the current transaction and am going to
terminate your database system connection and exit.
Please reconnect to the database system and repeat your query.
[Tue Nov 6 20:52:21 2001] null: Database handle destroyed
without explicit disconnect at
/usr/lib/perl5/site_perl/Apache/DBI.pm line 119.
-----------

A number of the 'null:' errors were reported. A similar NOTICE:
preceeded the following errors as well:

-----------
[Tue Nov 6 20:52:53 2001] [error] FATAL 2: XLogFlush: request
is not satisfied at
/home1/g/websiteadmin/site/modules//StandardUtil.pm line
48.

[Tue Nov 6 20:52:54 2001] null: DBD::Pg::db rollback failed:
rollback failed at /usr/lib/perl5/site_perl/Apache/DBI.pm
line 153.
-----------

I attempted to dump the database with pg_dump, and failed miserably.
The results convinced me that the database was _seriously_ mangled.
-----------
pg_dump -c -Fc -h postgres.sitehost.com -u websiteadmin >db.pgd
Username: websiteadmin
Password:
ERROR: Relation 419450449 does not exist
PQendcopy: resetting connection
SQL query to dump the contents of Table 'summary' did not
execute correctly. After we read all the table contents from
the backend, PQendcopy() failed. Explanation from backend:
'ERROR: Relation 419450449 does not exist '.
The query was: 'COPY "summary" TO stdout;'.

psql -h postgres.kattare.com -U websiteadmin -W
Password:
Welcome to psql, the PostgreSQL interactive terminal.
--snip--
websiteadmin=> vacuum verbose summary;
NOTICE: --Relation summary--
NOTICE: Pages 728: Changed 34, reaped 604, Empty 0, New 0; Tup
1032: Vac 5094, Keep/VTL 0/0, Crash 0, UnUsed 0, MinLen 96,
MaxLen 2030; Re-using: Free/Avail. Space 4913484/4913484;
EndEmpty/Avail. Pages 0/604. CPU 0.03s/0.00u sec.
NOTICE: Index summary_pkey: Pages 31; Tuples 1032: Deleted
5094. CPU 0.00s/0.06u sec.
NOTICE: Rel summary: Pages: 728 --> 118; Tuple(s) moved: 1032.
CPU 0.11s/0.16u sec.
NOTICE: Index summary_pkey: Pages 31; Tuples 1032: Deleted
1032. CPU 0.00s/0.01u sec.
NOTICE: --Relation pg_toast_20037--
NOTICE: Rel pg_toast_20037: TID 1444/14: OID IS INVALID.
TUPGONE 1.
NOTICE: Rel pg_toast_20037: TID 1444/19: OID IS INVALID.
TUPGONE 1.
-- Many, MANY similar lines snipped --
NOTICE: Rel pg_toast_20037: TID 1444/2008: OID IS INVALID.
TUPGONE 1.
NOTICE: Rel pg_toast_20037: TID 1444/2016: OID IS INVALID.
TUPGONE 1.
pqReadData() -- backend closed the channel unexpectedly. This
probably means the backend terminated abnormally before or while
processing the request.
The connection to the server was lost.
Attempting reset: Failed. !> -> \q

Workaround
----------

Due to time constraints, and because I had the data available, I
deleted the database with 'dropdb -h', re-created it using
'createdb', and re-populated it using the now-functioning Perl
script. Another choice might have been to restore the database from
a backup, assuming that the database corruption actually occurred in
conjunction with the initial script failure and not some time
earlier.

Environment
-----------

DB Server: Postgres 7.1.2, Slackware Linux 8.?
SELECT version() reports PostgreSQL 7.1.2 on
i686-pc-linux-gnu, compiled by GCC egcs-2.91.66

DB Client: Apache 1.3.20, Perl 5.6.1, mod_perl, Slackware 8.1

Questions
---------

1) Are there circumstances under which a "space exceeded" error on
a client machine _can_ damage a database on the Postgres server
machine?

2) The Postgres error messages all indicate that the transaction
was going to be "rolled back", but to all indications (including
the hung processes (;-)) it was not. Is this a bug? Or am I
misreading the message?

3) What other things could we have tried to recover this situation?

Could I have ignored Postgres' claim that it was rolling back
the transaction, re-connected to the database (with psql, say),
and issued a ROLLBACK myself?

4) Where would I find information that would help me accurately
estimate the client's disk space requirements for SQL database
processing on a Postgres server?

Frank McKenney

Frank McKenney, McKenney Associates
Richmond, Virginia / (804) 320-4887
E-mail: frank_mckenney(at)mindspring(dot)com

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message Stephan Szabo 2001-11-09 03:57:19 Re: Bug in 7.1.3 and 7.2?
Previous Message Stephen Pillinger 2001-11-08 16:04:55 PRNG not seeded => Seg.Fault