Index error on recovery

From: K P Manoj <kpmanojpg(at)gmail(dot)com>
To: pgsql-admin(at)postgresql(dot)org
Subject: Index error on recovery
Date: 2011-12-16 08:56:07
Message-ID: CAMVgnQ4TCJP3A6ELQO1wVb4uhfD=XkuSuh893SoyAuByk8-=gQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-admin

Hi

The details given below is for a test server, that I brought back up from
the base (after running a day worth wals and by giving recovery_target_time
before starting the recovery). The recovery went fine and the DB came
online as expected. But I see that although trivial select queries seem to
work well (while using indexes of the table concerned) it throws up errors
which indicate that the index is corrupted when we try to run a query that
requires access to all rows of the table.

My question is that if this is index corruption (seems like it) where did
we go wrong with the recovery (the live server is still working fine
without any such error in the live-server-logs)? Do we need to reindex
after such a recovery (I thought that was required only when we do
something like pgresetxlog)? If this isn't a fault with recovery, does it
mean that the primary DB (which is currently active) need strict checking?
Any way we could do that, without bringing down the DB (i.e. anything
besides a PGDump)?

Thanks
Manoj K P

============================

PG Version:
==========
mydb=# SELECT version () ;
version

------------------------------------------------------------------------------------------------------------------
PostgreSQL 8.4.9 on x86_64-redhat-linux-gnu, compiled by GCC gcc (GCC)
4.1.2 20080704 (Red Hat 4.1.2-50), 64-bit

Successful WAL Recovery:
=======================

Dec 6 14:25:47 database_host_name postgres[19329]: [3150-1] 2011-12-06
14:25:47.207 GMT 19329 LOG: restored log file "000000010000011000000013"
from archive
Dec 6 14:25:48 database_host_name postgres[19329]: [3151-1] 2011-12-06
14:25:48.540 GMT 19329 LOG: restored log file "000000010000011000000014"
from archive
Dec 6 14:25:49 database_host_name postgres[19329]: [3152-1] 2011-12-06
14:25:49.833 GMT 19329 LOG: restored log file "000000010000011000000015"
from archive
Dec 6 14:25:51 database_host_name postgres[19329]: [3153-1] 2011-12-06
14:25:51.755 GMT 19329 LOG: restored log file "000000010000011000000016"
from archive
Dec 6 14:25:53 database_host_name postgres[19329]: [3154-1] 2011-12-06
14:25:53.155 GMT 19329 LOG: restored log file "000000010000011000000017"
from archive
Dec 6 14:25:54 database_host_name postgres[19329]: [3155-1] 2011-12-06
14:25:54.462 GMT 19329 LOG: restored log file "000000010000011000000018"
from archive
Dec 6 14:25:54 database_host_name postgres[19329]: [3156-1] 2011-12-06
14:25:54.570 GMT 19329 LOG: recovery stopping before commit of transaction
19723746, time 2011-11-24 12:39:00.005756+00
Dec 6 14:25:54 database_host_name postgres[19329]: [3157-1] 2011-12-06
14:25:54.570 GMT 19329 LOG: redo done at 110/18F069E8
Dec 6 14:25:54 database_host_name postgres[19329]: [3158-1] 2011-12-06
14:25:54.570 GMT 19329 LOG: last completed transaction was at log time
2011-11-24 12:38:59.966238+00
Dec 6 14:25:54 database_host_name postgres[19329]: [3159-1] 2011-12-06
14:25:54.597 GMT 19329 LOG: selected new timeline ID: 2
Dec 6 14:25:54 database_host_name postgres[19329]: [3160-1] 2011-12-06
14:25:54.623 GMT 19329 LOG: archive recovery complete
Dec 6 14:25:55 database_host_name postgres[19382]: [20-1] 2011-12-06
14:25:55.845 GMT 19382 LOG: restartpoint complete: wrote 396106 buffers
(37.8%); write=318.653 s, sync=0.000 s, total=318.654 s
Dec 6 14:25:55 database_host_name postgres[19382]: [21-1] 2011-12-06
14:25:55.845 GMT 19382 LOG: recovery restart point at 10E/FAE0DFA8
Dec 6 14:25:55 database_host_name postgres[19382]: [22-1] 2011-12-06
14:25:55.848 GMT 19382 LOG: checkpoint starting: end-of-recovery immediate
wait
Dec 6 14:26:01 database_host_name postgres[19382]: [23-1] 2011-12-06
14:26:01.484 GMT 19382 LOG: checkpoint complete: wrote 376660 buffers
(35.9%); 0 transaction log file(s) added, 0 removed, 0 recycled;
write=5.635 s, sync=0.000 s, total=5.636 s
Dec 6 14:26:04 database_host_name postgres[19328]: [1-1] 2011-12-06
14:26:04.340 GMT 19328 LOG: database system is ready to accept connections

Basic Querying uses Index and works fine:
========================================

mydb =# EXPLAIN SELECT * from tb1_1 order by "ObjId" desc limit 10 ;

QUERY PLAN

------------------------------------------------------------------------------------------------------------------------------
--------
Limit (cost=0.00..7.41 rows=10 width=28)
-> Index Scan Backward using "IND_tbl_1_ObjId" on tbl_1
(cost=0.00..1174428208.24 rows=1584137511 width=28)

mydb=# SELECT * from users_activity order by "ObjId" desc limit 10 ;
ObjId | date | Usr | Obj_typ |
Id | Act_id | new_id
------------+---------------------------------+--------+--------------+-----------+----------+----------------
1907972745 | Thu 24 Nov 07:38:59.013068 2011 | 272 | 2 |
479937676 | 1 |
1907972743 | Thu 24 Nov 07:38:59.829819 2011 | 272 | 2 |
350683600 | 1 |
1907972742 | Thu 24 Nov 07:38:59.679056 2011 | 0 | 2 |
479937674 | 11 |
1907972741 | Thu 24 Nov 07:38:59.550655 2011 | 272 | 2 |
423704886 | 1 |
1907972740 | Thu 24 Nov 07:38:59.355022 2011 | 239 | 2 |
441683722 | 1 |
1907972739 | Thu 24 Nov 07:38:59.40387 2011 | 272 | 2 |
1061714 | 1 |
1907972738 | Thu 24 Nov 07:38:59.262335 2011 | 272 | 2 |
350683588 | 1 |
1907972737 | Thu 24 Nov 07:38:59.146628 2011 | 272 | 2 |
340702121 | 1 |
1907972736 | Thu 24 Nov 07:38:59.102374 2011 | 239 | 2 |
467763707 | 1 |
1907972735 | Thu 24 Nov 07:38:59.06309 2011 | 272 | 2 |
423704873 | 1 |
(10 rows)

Accessing Each Row:
==================

mydb=# ALTER TABLE tbl_1 ADD new_id bigint ;
mydb=# EXPLAIN UPDATE tbl_1 SET new_id=id ;
QUERY PLAN

-------------------------------------------------------------------------------
Seq Scan on tbl_1 (cost=0.00..30853774.16 rows=1554095373 width=34)
(1 row)

mydb=# UPDATE tbl_1 SET new_id=id ;

ERROR: right sibling's left-link doesn't match: block 41923 links to
4268638 instead of expected 41922 in index "IND_tbl_1_ObjId"

Further, tried to work around the Index error:
=============================================
mydb=# DROP INDEX "IND_tb_1_ObjId" ;
mydb=# UPDATE tbl_1 SET new_id=id ;

ERROR: right sibling's left-link doesn't match: block 52886 links to
4266706 instead of expected 52885 in index "IND_tbl_1_Act_ObjId"

But now we have an error with the next index on the same table.
==============================================================

Other notes:
1. Primary server was running when the base backup was taken.
2. Base backup was taken with start / stop and recovery time was 1-2 days
after that, so enough logs were played on the standby server.

Responses

Browse pgsql-admin by date

  From Date Subject
Next Message Rural Hunter 2011-12-17 03:36:59 trigger statistics?
Previous Message Rural Hunter 2011-12-16 00:55:33 Re: Issues with log-shipping replication