Re: Fwd: index corruption in PG 8.3.13

From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Greg Stark <gsstark(at)mit(dot)edu>
Cc: Nikhil Sontakke <nikhil(dot)sontakke(at)enterprisedb(dot)com>, Alvaro Herrera <alvherre(at)commandprompt(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Fwd: index corruption in PG 8.3.13
Date: 2011-03-10 13:45:22
Message-ID: AANLkTinUaGCL491fu==7h5FYJ-=hV7LqORP8QhHbZLgK@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, Mar 9, 2011 at 7:14 PM, Greg Stark <gsstark(at)mit(dot)edu> wrote:
> On Wed, Mar 9, 2011 at 11:28 PM, Nikhil Sontakke
> <nikhil(dot)sontakke(at)enterprisedb(dot)com> wrote:
>> "Other peculiarity in the index file is that we found a lot of zeroed
>> out pages. Blocks from #279 to #518 are all completely zeroed out
>> without any signs of even a page header. Any ideas on how we can get
>> so many zeroed out blocks? Apart from the extend code path, I fail to
>> see any other. And this is an unusually large number of zero pages"
>>
>
> What does stat say for the index data file? Are the Size and Blocks
> values the same (modulo block size)? Or are these blocks actually not
> allocated?
>
> Postgres always forces blocks to be allocated but if they were lost
> due to filesystem corruption maybe they're not allocated any more.

Hmm, that gives me an idea. What if the index relation were extended
repeatedly, but the blocks are still all zero on disk because they've
never been written out of shared_buffers? Then, before the next
checkpoint, the system crashes. All those blocks are now zero on
disk. In theory, WAL replay should fix most of it, but there could be
portions of the WAL that haven't been flushed; or there could be some
other problem in WAL replay. For example, it strikes me that this
situation would be consistent with:

1. Somebody inserts a bunch of new tuples into the relation, causing
growth in the index.
2. Before the blocks are written to disk, system crash.
3. pg_resetxlog

But it wouldn't have to be exactly that. Like what if during replay,
you hit a WAL record that had a bit flipped so it failed the checksum.
I think the system would just treat that as end-of-WAL and start up.
Now you could have some corruption, and a big chunk of zero blocks in
the index. Now you go along in normal running, maybe not realizing
that those zero blocks are there, and start extending the relation as
you need to update the index...

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Robert Haas 2011-03-10 14:00:47 Re: Update of replication/README
Previous Message Zotov 2011-03-10 13:12:29 Prefered Types