This is follow-up to a problem first reported on 3/1/04. The problem
has continued to occur intermittently and recently we experienced the
first occurrence where the first column of a table was the column where
the corrupted and thus we could not recover it.
Google groups searching have found numerous hits for people reporting
the same symptoms. While we've seen some instructions to get things
back, we've seen nothing about correcting the root cause.
This is becoming a major production problem and starting to cast doubt
on the "Postgres in Production" decision.
We've observed nothing that would lead us to believe there are any
hardware problems. Initially we were using write-caching using
battery-backed up cache but we turned that off and are using direct I/O
and still experiencing the same problem. Furthermore, the fact that the
problems seem isolated to 3 specific tables in a 50+ table database
makes us weary of hardware-level issues.
As far as matching up correct and corrupted rows, here's more detail on
a recent occurrence:
[root(at)cin1 backups]# /usr/local/pgsql/7.3.5/bin/pg_dump -Ft -p 5432 -U
postgres solo > solo.dmp
pg_dump: ERROR: MemoryContextAlloc: invalid request size 4294967293
pg_dump: lost synchronization with server, resetting connection
pg_dump: SQL command to dump the contents of table
"freight_track_detail" failed: PQendcopy() failed.
pg_dump: Error message from server: pg_dump: The command was: COPY
public.freight_track_detail (ftd_uid, ftm_uid, txl_uid, ref_nbr_1,
ref_nbr_2, fg_tab_uid, fg_tab_alias, ftd_status_code, scan_timestamp,
add_userid, add_timestamp, mod_userid, mod_timestamp) TO stdout;
118171 512 2159 00004300854908405208 46366 FGI
2004-01-21 12:25:00 postgres 2004-01-21 15:39:29 OSD
118153 512 2159 00004304730000071106 46990 FGI
2004-01-21 12:20:00 post 2000-01-01 00:00:00
..End of output.
The second row shows the vchar userid getting lopped off after the first
4 characters. Note that we've experienced this problem with several
different vchar-typed columns though, as mentioned before, we have
recently seen corruption of integer typed columns.
If we issued an update setting that column plus the subsequent 3 columns
to "null", everything then was back to normal. This row was right in
the middle of the table.
Furthermore, we recently found problems reported in the same table from
nightly vacuums. See the following cron-generated emails that contain
error messages as well as datetimes to show the temporal relationship
between these problems:
In response to
pgsql-admin by date
|Next:||From: scott.marlowe||Date: 2004-03-17 16:51:20|
|Subject: Re: Row data corruption under 7.3.5|
|Previous:||From: 김도형||Date: 2004-03-17 12:34:48|
|Subject: what the initial 3 daemon of postmaster process do?|