Re: Database corruption help

From: John Lister <john(dot)lister-ps(at)kickstone(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Database corruption help
Date: 2009-02-13 09:17:39
Message-ID: 49953AB3.50304@kickstone.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-admin pgsql-hackers

>Please send to pgsql-hackers --- I'd like to get more eyeballs on this.
>There's no personally identifiable information here except that you've
>got a table named temp_queue that you've repeatedly TRUNCATEd or
>CLUSTERed or some such (likely the former since the reltuples counts
>are all zero). It'd be useful to know exactly what you were doing
>with it, though.
>
> regards, tom lane

More info as requested....

TEMP_QUEUE is used continously throughout the day (for some reason which
i chose not to use a normal temp table) data is grouped and inserted
from other tables, before being processed, and you are right the table
is then truncated before repeating - this happens at least once a minute
throughout the day..

After restarted following the pg_resetxlog, i noticed that the
temp_queue table was missing (had been created a day ago). I couldn't
select from it and had problems trying to recreate it (bits seemed to
exist already). I couldn't see it in pg_class or the other catalogs, but
after selecting to ignore the presumably corrupt indexes it appeared so
i manually removed the references (I've since learned about the -P
option to the server which may have helped here). The corruption in
pg_class manifested itself as multiple rows (for this and another table)
with the same OIDs

So not sure at what point what if anything got corrupted and what my
subsequent actions did to compound this..

Thanks

John Lister wrote:
> Originally in psql-admin, but copied here at the request of Tom to..
>
> Story so far, transaction log archiving went wrong causing the
> transaction log disk to fill up. Foolishly i deleted the unarchived
> transaction logs (early monday morning) which required a pg_resetxlog
> to get the db up and running again. Since then we've had some minor db
> corruption which has been fixed (mainly duplicate primary keys) except
> for the pg_class table.
> If i do a vacuum full on pg_class i get something like this:
>
> INFO: vacuuming "pg_catalog.pg_class"INFO: "pg_class": found 37
> removable, 1845 nonremovable row versions in 18905 pages
> DETAIL: 27 dead row versions cannot be removed yet.
> Nonremovable row versions range from 160 to 229 bytes long.
> There were 933834 unused item pointers.
> Total free space (including removable row versions) is 150368692 bytes.
> 18839 pages are or will become empty, including 0 at the end of the
> table.
> 0 pages containing 0 free bytes are potential move destinations.
> CPU 0.01s/0.05u sec elapsed 0.17 sec.INFO: index "pg_class_oid_index"
> now contains 1813 row versions in 7 pages
> DETAIL: 56 index row versions were removed.
> 0 index pages have been deleted, 0 are currently reusable.
> CPU 0.00s/0.00u sec elapsed 0.00 sec.INFO: index
> "pg_class_relname_nsp_index" now contains 1818 row versions in 24 pages
> DETAIL: 63 index row versions were removed.
> 0 index pages have been deleted, 0 are currently reusable.
> CPU 0.00s/0.00u sec elapsed 0.00 sec.WARNING: index
> "pg_class_relname_nsp_index" contains 1818 row versions, but table
> contains 1813 row versions
> HINT: Rebuild the index with REINDEX.INFO: analyzing
> "pg_catalog.pg_class"INFO: "pg_class": scanned 3000 of 18905 pages,
> containing 227 live rows and 6 dead rows; 227 rows in sample, 1430
> estimated total rows
> Total query runtime: 4469 ms.
>
> As you can see there are non-removable dead rows (which slowly grows)
> and the table size is also increasing in size.. A reindex on the
> indexes mentions also succeeds but another vacuum reports the same
> thing...
>
> In the log files the relevant bits are:
>
> GMT LOG: relation "pg_class" TID 15538/4: dead HOT-updated tuple ---
> cannot shrink relation
> 2009-02-12 21:06:40 GMT STATEMENT: VACUUM FULL VERBOSE ANALYZE pg_class
> 2009-02-12 21:06:40 GMT WARNING: index "pg_class_relname_nsp_index"
> contains 1818 row versions, but table contains 1813 row versions
>
>
> and finally the last message in the psql-admin thread suggested
> dumping the above page, which is attached before.
>
>> Hmm. The comments in vacuum.c about this case suppose that it could
>> only be a transient condition, ie the tuple became dead while we were
>> looking at the page. Evidently it's persistent for you, which means
>> that for some reason heap_page_prune() is failing to remove an
>> already-dead tuple. I suspect this implies a corrupt HOT chain, but
>> maybe the data is okay and what you've got is really a bug in
>> heap_page_prune.
>
>> Could you send a dump of page 15538 of pg_class, using pg_filedump?
>> The source code for it is here:
>> http://sources.redhat.com/rhdb/utilities.html
>> Best would be -i -f format, ie
>> pg_filedump -i -f -R 15538 $PGDATA/base/something/1259
>>
>> regards, tom lane
>
>
>
> Any help would be appreciated as the pg_class table is constantly
> growing which i'm guessing is going to start to affect performance
> fairly soon. I'd like to avoid a full restore from backup if possible.
>
> Thanks
>
> JOHN
> ------------------------------------------------------------------------
>
>
>

In response to

Browse pgsql-admin by date

  From Date Subject
Next Message Tom Lane 2009-02-13 16:19:00 Re: Database corruption help
Previous Message John Lister 2009-02-13 08:20:03 Database corruption help

Browse pgsql-hackers by date

  From Date Subject
Next Message Peter Eisentraut 2009-02-13 09:27:01 Re: per-database locale: createdb switches
Previous Message Peter Eisentraut 2009-02-13 09:08:48 Re: Which installation parts are backward compatible?