| From: | "Marc Mitchell" <marcm(at)eisolution(dot)com> | 
|---|---|
| To: | <pgsql-admin(at)postgresql(dot)org> | 
| Subject: | Re: Row data corruption under 7.3.5 | 
| Date: | 2004-03-17 16:12:47 | 
| Message-ID: | 001f01c40c3a$b1a95e80$6701050a@MarcM8500 | 
| Views: | Whole Thread | Raw Message | Download mbox | Resend email | 
| Thread: | |
| Lists: | pgsql-admin | 
This is follow-up to a problem first reported on 3/1/04.  The problem
has continued to occur intermittently and recently we experienced the
first occurrence where the first column of a table was the column where
the corrupted and thus we could not recover it.
Google groups searching have found numerous hits for people reporting
the same symptoms.  While we've seen some instructions to get things
back, we've seen nothing about correcting the root cause.
This is becoming a major production problem and starting to cast doubt
on the "Postgres in Production" decision.
We've observed nothing that would lead us to believe there are any
hardware problems.  Initially we were using write-caching using
battery-backed up cache but we turned that off and are using direct I/O
and still experiencing the same problem.  Furthermore, the fact that the
problems seem isolated to 3 specific tables in a 50+ table database
makes us weary of hardware-level issues.
As far as matching up correct and corrupted rows, here's more detail on
a recent occurrence:
[root(at)cin1 backups]# /usr/local/pgsql/7.3.5/bin/pg_dump -Ft -p 5432 -U
postgres solo > solo.dmp
pg_dump: ERROR:  MemoryContextAlloc: invalid request size 4294967293
pg_dump: lost synchronization with server, resetting connection
pg_dump: SQL command to dump the contents of table
"freight_track_detail" failed: PQendcopy() failed.
pg_dump: Error message from server: pg_dump: The command was: COPY
public.freight_track_detail (ftd_uid, ftm_uid, txl_uid, ref_nbr_1,
ref_nbr_2, fg_tab_uid, fg_tab_alias, ftd_status_code, scan_timestamp,
add_userid, add_timestamp, mod_userid, mod_timestamp) TO stdout;
..<snip>..
118171  512     2159            00004300854908405208    46366   FGI
2004-01-21 12:25:00     postgres        2004-01-21 15:39:29     OSD
2004-01-22 01:04:40
118153  512     2159            00004304730000071106    46990   FGI
2004-01-21 12:20:00     post    2000-01-01 00:00:00
..End of output.
The second row shows the vchar userid getting lopped off after the first
4 characters.  Note that we've experienced this problem with several
different vchar-typed columns though, as mentioned before, we have
recently seen corruption of integer typed columns.
If we issued an update setting that column plus the subsequent 3 columns
to "null", everything then was back to normal.  This row was right in
the middle of the table.
Furthermore, we recently found problems reported in the same table from
nightly vacuums.  See the following cron-generated emails that contain
error messages as well as datetimes to show the temporal relationship
between these problems:
| From | Date | Subject | |
|---|---|---|---|
| Next Message | scott.marlowe | 2004-03-17 16:51:20 | Re: Row data corruption under 7.3.5 | 
| Previous Message | κΉλν | 2004-03-17 12:34:48 | what the initial 3 daemon of postmaster process do? |