Explained by known hardware failures, or keep looking?

From: "Kevin Grittner" <Kevin(dot)Grittner(at)wicourts(dot)gov>
To: <pgsql-admin(at)postgresql(dot)org>
Subject: Explained by known hardware failures, or keep looking?
Date: 2007-06-18 19:41:59
Message-ID: 467699B7.EE98.0025.0@wicourts.gov
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-admin

Linux version 2.6.5-7.252-bigsmp (geeko(at)buildhost) (gcc version 3.3.3 (SuSE Linux)) #1 SMP Tue Feb 14 11:11:04 UTC 2006
PostgreSQL 8.2.4 on i686-pc-linux-gnu, compiled by GCC gcc (GCC) 3.3.3 (SuSE Linux)
Database 212GB on RAID 5.
4 x Intel(R) XEON(TM) MP CPU 2.00GHz

total used free shared buffers cached
Mem: 6213064 5884120 328944 0 0 5135240
-/+ buffers/cache: 748880 5464184
Swap: 6293496 8212 6285284

pg_control version number: 822
Catalog version number: 200611241
Database system identifier: 5005146534765255671
Database cluster state: in production
pg_control last modified: Mon 18 Jun 2007 01:29:57 PM CDT
Current log file ID: 471
Next log file segment: 217
Latest checkpoint location: 1D7/D832669C
Prior checkpoint location: 1D7/D81753D8
Latest checkpoint's REDO location: 1D7/D832669C
Latest checkpoint's UNDO location: 0/0
Latest checkpoint's TimeLineID: 1
Latest checkpoint's NextXID: 0/612647174
Latest checkpoint's NextOID: 2450764
Latest checkpoint's NextMultiXactId: 1
Latest checkpoint's NextMultiOffset: 0
Time of latest checkpoint: Mon 18 Jun 2007 01:29:57 PM CDT
Minimum recovery ending location: 118/99F15C2C
Maximum data alignment: 4
Database block size: 8192
Blocks per segment of large relation: 131072
WAL block size: 8192
Bytes per WAL segment: 16777216
Maximum length of identifiers: 64
Maximum columns in an index: 32
Date/time type storage: 64-bit integers
Maximum length of locale name: 128
LC_COLLATE: C
LC_CTYPE: C

listen_addresses = '*' # what IP address(es) to listen on;
max_connections = 200 # (change requires restart)
shared_buffers = 160MB # min 128kB or max_connections*16kB
temp_buffers = 50MB # min 800kB
work_mem = 10MB # min 64kB
maintenance_work_mem = 160MB # min 1MB
max_fsm_pages = 800000 # min max_fsm_relations*16, 6 bytes each
bgwriter_lru_percent = 20.0 # 0-100% of LRU buffers scanned/round
bgwriter_lru_maxpages = 200 # 0-1000 buffers max written/round
bgwriter_all_percent = 10.0 # 0-100% of all buffers scanned/round
bgwriter_all_maxpages = 600 # 0-1000 buffers max written/round
wal_buffers = 160kB # min 32kB
checkpoint_segments = 10 # in logfile segments, min 1, 16MB each
archive_command = ''
random_page_cost = 2.0 # same scale as above
effective_cache_size = 5GB
redirect_stderr = on # Enable capturing of stderr into log
log_line_prefix = '[%m] %p %q<%u %d %r> ' # Special values:
stats_block_level = on
stats_row_level = on
autovacuum = on # enable autovacuum subprocess?
autovacuum_naptime = 10s # time between autovacuum runs
autovacuum_vacuum_threshold = 1 # min # of tuple updates before
autovacuum_analyze_threshold = 1 # min # of tuple updates before
datestyle = 'iso, mdy'
lc_messages = 'C' # locale for system error message
lc_monetary = 'C' # locale for monetary formatting
lc_numeric = 'C' # locale for number formatting
lc_time = 'C' # locale for time formatting
escape_string_warning = off
standard_conforming_strings = on
sql_inheritance = off

First off, we're not looking for help with recovery of the data -- we run several copies of the failed database and we've load shifted to work around the failure until we can do a PITR-style recovery from one of the working copies. I'm just trying to determine whether the problems we're seeing sound likely to be caused by a known event, or whether we should be casting about for other possible causes.

On May 30th, a database server suffered a failure in its battery-backed RAID controller card. We replaced it and brought the database back up. It seemed to come up clean, and we didn't have any noticeable problems.

On June 14th, the server abruptly lost power. When power was restored PostgreSQL came back up with these messages:

[2007-06-14 11:31:05.949 CDT] 6781 LOG: database system was interrupted at 2007-06-14 11:18:53 CDT
[2007-06-14 11:31:05.958 CDT] 6781 LOG: checkpoint record is at 1D2/6CB2EFA0
[2007-06-14 11:31:05.958 CDT] 6781 LOG: redo record is at 1D2/6C739064; undo record is at 0/0; shutdown FALSE
[2007-06-14 11:31:05.958 CDT] 6781 LOG: next transaction ID: 0/607370870; next OID: 2336076
[2007-06-14 11:31:05.958 CDT] 6781 LOG: next MultiXactId: 1; next MultiXactOffset: 0
[2007-06-14 11:31:05.958 CDT] 6781 LOG: database system was not properly shut down; automatic recovery in progress
[2007-06-14 11:31:05.986 CDT] 6781 LOG: redo starts at 1D2/6C739064
[2007-06-14 11:31:46.533 CDT] 6781 WARNING: invalid page header in block 182566 of relation "1523860"; zeroing out page
[2007-06-14 11:31:46.533 CDT] 6781 CONTEXT: xlog redo split_r: rel 1663/16386/1523860; tid 182566/92; oth 182563; rgh 115741
[2007-06-14 11:31:56.228 CDT] 6781 WARNING: invalid page header in block 182567 of relation "1523860"; zeroing out page
[2007-06-14 11:31:56.229 CDT] 6781 CONTEXT: xlog redo split_r: rel 1663/16386/1523860; tid 182567/94; oth 182128; rgh 114655
[2007-06-14 11:32:04.964 CDT] 6781 WARNING: invalid page header in block 123644 of relation "1524189"; zeroing out page
[2007-06-14 11:32:04.964 CDT] 6781 CONTEXT: xlog redo split_r: rel 1663/16386/1524189; tid 123644/101; oth 123634; rgh 106665
[2007-06-14 11:32:11.327 CDT] 6781 WARNING: invalid page header in block 356562 of relation "1524219"; zeroing out page
[2007-06-14 11:32:11.327 CDT] 6781 CONTEXT: xlog redo split_r: rel 1663/16386/1524219; tid 356562/58; oth 356549; rgh 34892
[2007-06-14 11:32:14.795 CDT] 6781 LOG: record with zero length at 1D2/70C31890
[2007-06-14 11:32:14.795 CDT] 6781 LOG: redo done at 1D2/70C31868
[2007-06-14 11:32:33.833 CDT] 6781 LOG: database system is ready

Over the weekend it fell apart. The problems occurred right at the end of a weekly maintenance process which builds a new version of a table based on records retention rules. It is built under a temporary name; then the previous version of the table is dropped and the new table is renamed. This leaves a fraction of a second during which queries may fail on the missing table, but it seems to be better than the alternatives. (If a query doesn't complete within 20 seconds, it is an error for the users of these tables, since our web app times out. The alternatives which blocked rather than giving outright errors blocked for more than 20.25 seconds, so this alternative generates the fewest errors from a user perspective.)

[2007-06-17 08:22:41.925 CDT] 10674 <postgres bigbird [local]> NOTICE: ALTER TABLE / ADD PRIMARY KEY will create implicit index "HiddenCase_pkey" fo
r table "HiddenCaseTT"
[2007-06-17 08:24:10.511 CDT] 6868 LOG: checkpoints are occurring too frequently (20 seconds apart)
[2007-06-17 08:24:10.511 CDT] 6868 HINT: Consider increasing the configuration parameter "checkpoint_segments".
[2007-06-17 17:05:26.707 CDT] 11860 <bigbird bigbird 127.0.0.1(33210)> ERROR: invalid page header in block 148284 of relation "RcptTender"
[2007-06-17 17:05:26.707 CDT] 11860 <bigbird bigbird 127.0.0.1(33210)> STATEMENT: SELECT "amtPaid", "rcptNo", "tenderCode", "tenderSeqNo", "ccAuthN
o", "ccExp", "ccNo", "ccType", "comments", "countyNo" FROM "RcptTender" WHERE ("countyNo"=37) AND ("rcptNo" >= '07R 012287' AND ("rcptNo" > '07R 0122
87' OR "tenderSeqNo" >= 1)) AND ("rcptNo" <= '07R 012783' AND ("rcptNo" < '07R 012783' OR "tenderSeqNo" <= 1)) ORDER BY "rcptNo", "tenderSeqNo", "cou
ntyNo"
[2007-06-17 17:05:33.346 CDT] 11854 <bigbird bigbird 127.0.0.1(33204)> ERROR: duplicate key violates unique constraint "DocImageMetaData_pkey"
[2007-06-17 17:05:33.347 CDT] 11854 <bigbird bigbird 127.0.0.1(33204)> STATEMENT: INSERT INTO "DocImageMetaData" ("isConfidential", "docId", "insert
edDate", "docSource", "sectionNo", "countyNo", "eFilingRcvdDate", "caseNo", "docHash", "isEFiling", "imageSize", "filedReceived", "isSignatProcessed"
, "insertedBy", "isDuplex", "newDocId", "docName", "histSeqNo", "orientation", "isSignatRequired", "eFilingFiledDate", "paperSize", "isHidden", "page
s", "dateLastAccessed") VALUES ($1, $2, $3, $4, $5, $6, NULL, $7, $8, $9, $10, $11, $12, $13, $14, NULL, $15, $16, $17, $18, NULL, $19, $20, $21, $22
)
[2007-06-17 17:05:34.070 CDT] 6779 LOG: server process (PID 11854) was terminated by signal 11
[2007-06-17 17:05:34.070 CDT] 6779 LOG: terminating any other active server processes

We are now getting "invalid page header" messages on at least five different tables and at least 10 different indexes. We're seeing signal 11 from time to time. We're seeing the database, while still running, report that it is starting up or that it is in recovery mode for periods of time. We're seeing a variety of strange messages indicating corruption at startup -- I can post more detail if someone thinks these would help diagnose the issue, but they include:

The most recent start included these lines:

[2007-06-18 13:24:37.490 CDT] 11981 LOG: redo starts at 1D7/D7DC8D8C
[2007-06-18 13:24:37.526 CDT] 11981 LOG: unexpected pageaddr 1D7/C1FE8000 in log file 471, segment 215, offset 16678912
[2007-06-18 13:24:37.526 CDT] 11981 LOG: redo done at 1D7/D7FE6730
[2007-06-18 13:24:37.895 CDT] 11981 LOG: database system is ready
[2007-06-18 13:24:57.380 CDT] 12034 <bigbird bigbird 127.0.0.1(51781)> ERROR: could not access status of transaction 10
[2007-06-18 13:24:57.380 CDT] 12034 <bigbird bigbird 127.0.0.1(51781)> DETAIL: Could not open file "pg_clog/0000": No such file or directory.

Could all of this be reasonably explained by the controller failure and/or the subsequent abrupt power loss, or should I be looking for another cause? Personally, as I look at this, I'm suspicious that either the controller didn't persist dirty pages in the June 14th failure or there is some ongoing hardware problem.

Thoughts or suggestions?

-Kevin

Responses

Browse pgsql-admin by date

  From Date Subject
Next Message Tom Lane 2007-06-18 20:25:24 Re: Explained by known hardware failures, or keep looking?
Previous Message Adam Tauno Williams 2007-06-18 18:54:10 Re: [pgsql-advocacy] [PERFORM] Postgres VS Oracle