Database corruption

From: Waldo Nell <pwnell(at)telkomsa(dot)net>
To: pgsql-general(at)postgresql(dot)org
Subject: Database corruption
Date: 2006-09-01 15:02:24
Message-ID: D15255F3-FB3D-4AA4-8145-65B8DD83BCA0@telkomsa.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

We have recently upgraded from PostgreSQL 7.4.5 to 8.1.4. Our DB is
about 45GB in size and has about 100 tables, lots of stored
procedures, triggers etc.

The DB lived on a SAN and is accessed via Fibre Channel cards in an
IBM BladeCenter. The system is running RedHat Linux Enterprise 3 I
think. We connect to it using the latest JDBC driver from Java 1.5.

Here is the problem. Yesterday night I saw one of my statements in
my code threw this exception:

net.za.pwnconsulting.dblayer.query.DBQueryLayerException: Could not
execute SQL query [select c.mc_ca_id as c_mc_ca_id, d.mc_ca_id as
d_mc_ca_id, b.ca_id, b.ca_ctr_id from contract a, contract_agreement
b left outer join monthend_contract c on c.mc_ca_id = ? and
c.mc_last_proc_date = ? left outer join monthend2_contract d on
d.mc_ca_id = ? and d.mc_last_proc_date = ? where a.ctr_id =
b.ca_ctr_id and b.ca_agreement_type = ? and b.ca_start_date > ? and
b.ca_start_date <= ? and a.ctr_id = ? and ca_level_company = ? and
ca_level_data = ? and ca_level_type = 'V' and ca_id <> ? limit 1] in
pool [Oasis] because: ERROR: out of memory
Detail: Failed on request of size 20.
org.postgresql.util.PSQLException: ERROR: out of memory
Detail: Failed on request of size 20.
at org.postgresql.core.v3.QueryExecutorImpl.receiveErrorResponse
(QueryExecutorImpl.java:1525)
at org.postgresql.core.v3.QueryExecutorImpl.processResults
(QueryExecutorImpl.java:1309)
at org.postgresql.core.v3.QueryExecutorImpl.execute
(QueryExecutorImpl.java:188)
at org.postgresql.jdbc2.AbstractJdbc2Statement.execute
(AbstractJdbc2Statement.java:452)
at org.postgresql.jdbc2.AbstractJdbc2Statement.executeWithFlags
(AbstractJdbc2Statement.java:340)
at org.postgresql.jdbc2.AbstractJdbc2Statement.executeQuery
(AbstractJdbc2Statement.java:239)
at
net.za.pwnconsulting.dblayer.query.StandardQueryLayer.queryDatabaseInter
nal(StandardQueryLayer.java:289)
at
net.za.pwnconsulting.dblayer.query.StandardQueryLayer.queryDatabaseInter
nalExceptionWrapper(StandardQueryLayer.java:158)
at
net.za.pwnconsulting.dblayer.query.StandardQueryLayer.queryDatabase
(StandardQueryLayer.java:68)
at
net.za.pwnconsulting.dblayer.query.StandardQueryLayer.queryDatabase
(StandardQueryLayer.java:58)
at net.za.pwnconsulting.dblayer.AbstractDBLayer.queryDatabase
(AbstractDBLayer.java:57)
at
za.co.massmart.oasis.daemons.ContractExtensionTimerTask.safeExecuteInter
nal(ContractExtensionTimerTask.java:183)
at za.co.massmart.oasis.daemons.ExtendedHACronTimerTask.safeExecute
(ExtendedHACronTimerTask.java:94)
at net.za.pwnconsulting.javaconfig.utils.time.HACronTimerTask.execute
(HACronTimerTask.java:156)
at net.za.pwnconsulting.javaconfig.utils.time.CronTimerTask.run
(CronTimerTask.java:103)
at java.lang.Thread.run(Thread.java:595)

So I checked and saw another application connecting to another DB in
the same instance showed a very weird error. For an "update
tablename set field1 = X where primary_field = Y" statement I got a
duplicate key exception on the serial field "primary_field" - which
is impossible since I am not changing its value - I am updating a
different column. Vacuum full analyze did not fix this. I dropped
and recreated the index, then the error went away. This was just
after I restarted the postgres instance (I stopped it using pg_ctl
stop -m immediate" since there were connections I could not close
from other systems.

I restarted the application that encountered the first error above
and ran the task again - it worked. So I then started a vacuum full
analyze on the main production DB. It produced several warnings
about page sizes not being large enough I think, but it completed
successfully.

This morning we found some data generated this morning at 01:00 was
present, but most of the tables lost all data captured since 25
August 2006. That amounts to about 500 contracts, a while billing
cycle etc. We upgraded postgresql on the 20th.

I checked the file system and it yielded no errors (ext3). I checked
the server and there were no obvious hardware issues.

How can part of the DB just be lost like that? I have just verified
by restoring a backup made yesterday night at 20:00 - before the
errors I showed above happened - that DB also has the data missing.
The bill run happened yesterday afternoon so somewhere between that
something broke. Any ideas on how to trouble shoot this?

To summarise - the DB lost about 6 days of transactions for no known
reason.

Browse pgsql-general by date

  From Date Subject
Next Message Anastasios Hatzis 2006-09-01 15:30:56 Duplicating rows in one table but with one column value different
Previous Message Martijn van Oosterhout 2006-09-01 14:55:14 Re: [pgsql-advocacy] Thought provoking piece on