Re: Tracking down log segment corruption

From: Gordon Shannon <gordo169(at)gmail(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: pgsql-general(at)postgresql(dot)org
Subject: Re: Tracking down log segment corruption
Date: 2010-05-02 17:55:26
Message-ID: s2qb2dd93301005021055w93b80fb6q9ec27b826a550f7d@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

On Sun, May 2, 2010 at 11:02 AM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:

>
>
> Hmm ... AFAICS the only way to get that message when the incoming TID's
> offsetNumber is only 2 is for the index page to be completely empty
> (not zeroes, else PageAddItem's sanity check would have triggered,
> but valid and empty). What that smells like is a software bug, like
> failing to emit a WAL record in a case where it was necessary. Can you
> identify which index this was? (Look for relfilenode 48778276 in the
> database with OID 22362.) If so, can you give us any hints about
> unusual things that might have been done with that index?
>

Interesting. There is no pg_class entry for 22362. There is, however, an
entry for that filenode. It's an index I created Sat AM, about 6AM.

select oid, * from pg_class where relfilenode=48778276;
-[ RECORD 1 ]--+---------------------------
oid | 48777488
relname | cts_20100501_topic_date_nk
relnamespace | 2200
reltype | 0
relowner | 16412
relam | 403
relfilenode | 48778276
reltablespace | 48777166
relpages | 2476
reltuples | 58879
reltoastrelid | 0
reltoastidxid | 0
relhasindex | f
relisshared | f
relistemp | f
relkind | i
relnatts | 2
relchecks | 0
relhasoids | f
relhaspkey | f
relhasrules | f
relhastriggers | f
relhassubclass | f
relfrozenxid | 0
relacl | null
reloptions | null

Possibly relevant facts:

- The WSB server went active on Friday around 3:30PM
- On Friday evening, I added about 11 tablespaces. I noted the new files on
the WSB, no problems.
- On Sat morning, I created a partitioned table cts_20100501 (inherits from
another table) and 4 indexes.
- This morning, I was doing some table maintenance on the master and
discovered I had created this table and its indexes in the wrong tablespace.
I wanted the table in ts29, but had it in ts30. Vice versa for the
indexes. So I moved them. This is from my command history:

alter index cts_20100501_natural_uk set tablespace ts30;
alter index cts_20100501_pkey set tablespace ts30;
alter index cts_20100501_topic_date_nk set tablespace ts30;
alter index cts_20100501_updated_nk set tablespace ts30;
alter table cts_20100501 set tablespace ts29;

These commands worked fine on the master, yet this seems suspiciously
relevant.

> > Any suggestions?
>
> As far as recovering goes, there's probably not much you can do except
> resync the standby from scratch. But it would be nice to get to the
> bottom of the problem, so that we can fix the bug. Have you got an
> archive of this xlog segment and the ones before it, and would you be
> willing to let a developer look at them?
>
>
Before I received your reply, I had already started the re-sync, and
unfortunately already deleted the wal logs in question.
If it happens again, I will certainly keep them, and would be happy to share
them.

Regards,
Gordon

In response to

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Andy 2010-05-02 17:57:33 Re: PostgreSQL vs. Microsoft SQL server
Previous Message Tom Lane 2010-05-02 17:02:03 Re: Tracking down log segment corruption