Missing pg_clog file / corrupt index / invalid page header

From: alex <an(at)clickware(dot)de>
To: pgsql-bugs(at)postgresql(dot)org
Subject: Missing pg_clog file / corrupt index / invalid page header
Date: 2007-09-05 06:18:31
Message-ID: 46DE4A37.3020703@clickware.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

My colleague Marc Schablewski reported this Bug (#3484) the first time
at the end of July.
The described problem occured twice at our database and now it happened
again.

Summary
==========

Various errors like:
"invalid page header in block 8658 of relation",
"could not open segment 2 of relation 1663/77142409/266753945 (target
block 809775152)",
"ERROR: could not access status of transaction 2134240 DETAIL: could
not open file "pg_clog/0002": File not found",
"CEST PANIC: right sibling's left-link doesn't match"

on the following system:
Postgres 8.1.8
SUsE Linux Kernel 2.6.13-15.8-smp
2 Intel XEON Processors with 2 cores each
ECC-Ram
Hardware Raid (mirror set)

Detailed description
=======================

The message was thrown by the nightly pg_dump:
pg_dump: ERROR: invalid page header in block 8658 of relation
"import_data_zeilen"
pg_dump: SQL command to dump the contents of table "import_data_zeilen"
failed: PQendcopy() failed.
pg_dump: Error message from server: ERROR: invalid page header in block
8658 of relation "import_data_zeilen"
pg_dump: The command was: COPY public.import_data_zeilen (id, eda_id,
zeile, man_id, sta_id) TO stdout;

A manually executed dedicated dump on the concerned table was processed
successfully ( at daytime! )
We were really suprised!
Also, select-queries (using indexes) on the table succeeded.
(in the past when the error occured, select-queries failed).
So, no repair seemed to be needed for the table.

The following night, the pg_dump succeeded, but the "vacuum
analyze" (executed after the pg_dump) threw the same error:

INFO: vacuuming "public.import_data_zeilen"
ERROR: invalid page header in block 8658 of relation "import_data_zeilen"

Any select on this table using indexes now failed!
( if the resultset contained the corrupted data )

This behaviour is very confusing.

Re-creating the table solved the problem. However, the damaged rows were
lost.

We have two systems, one active, one for tests.
They are nearly identical, having similar hardare, using the same
software and they are running under the same load.
The errors always occured on the active server, the test-server didn't
run into errors after upgrading both servers from 8.1.3 to 8.1.8.

So even though no hardware errors were detected (neither ECC-RAM-Errors
nor disk errors) we decided to swap the server's roles, to find out if
its a hardware or software problem.
This was 12 days ago.

Now we got another error, again on the active system (which now uses the
hardware from the other system except for the one of the hard disks in
the raid), which was thrown by an insert statement done by the software:
org.postgresql.util.PSQLException: ERROR: could not open segment 2 of
relation 1663/77142409/266753945 (target block 809775152): Datei oder
Verzeichnis nicht gefunden.

Obviously we have a problem with the he active server.
But its unlikely to be a hardware problem, because we changed the hard
disks and the error occured at the same (software) system.
Also we are using ECC-Ram and a raid system (mirrorset) with hardware
raid controller, which hasn't reported any errors.

We read the last post/thread concerning this bug. In this thread the
problem was connected to some kernel bug in 2.6.11.
We are using a higher Linux version: 2.6.13-15.8-smp.
Hardware system: 2 dual core processor ( Intel(R) Xeon(TM) CPU 2.80GHz )
postgres-Version: 8.1.8

We have done a lot of database maintenance 4 days ago, which among other
updates dropped about 10 indexes on one big table ( 35'000'000
recordsets ) and created some other 10 indexes (for better performance).

Given that the problem occurred on two different machines we are very
sure that it is *not* a hardware problem.

We would really appreciate any help with our problems.

Thanks in advance

A. Nitzschke

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message Tom Lane 2007-09-05 07:02:11 Re: Missing pg_clog file / corrupt index / invalid page header
Previous Message Hiroshi Saito 2007-09-05 01:52:36 Re: BUG #3600: ODBC Driver not working with BIGINT