Re: 9.0.4 Data corruption issue

From: Ken Caruso <ken(at)ipl31(dot)net>
To: Cédric Villemain <cedric(dot)villemain(dot)debian(at)gmail(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, "pgsql-admin(at)postgresql(dot)org" <pgsql-admin(at)postgresql(dot)org>
Subject: Re: 9.0.4 Data corruption issue
Date: 2011-07-19 19:27:22
Message-ID: CAMg8r_o-ZT5-3JHaMbL4FRQcy0ZxwO9V+JxW2EyPB4CyMzy48g@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-admin

On Sun, Jul 17, 2011 at 3:04 AM, Cédric Villemain <
cedric(dot)villemain(dot)debian(at)gmail(dot)com> wrote:

> 2011/7/17 Ken Caruso <ken(at)ipl31(dot)net>:
> >
> >
> > On Sat, Jul 16, 2011 at 2:30 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> >>
> >> Ken Caruso <ken(at)ipl31(dot)net> writes:
> >> > Sorry, the actual error reported by CLUSTER is:
> >>
> >> > gpup=> cluster verbose tablename;
> >> > INFO: clustering "dbname.tablename"
> >> > WARNING: could not write block 12125253 of base/2651908/652397108
> >> > DETAIL: Multiple failures --- write error might be permanent.
> >> > ERROR: could not open file "base/2651908/652397108.1" (target block
> >> > 12125253): No such file or directory
> >> > CONTEXT: writing block 12125253 of relation base/2651908/652397108
> >>
> >> Hmm ... it looks like you've got a dirty buffer in shared memory that
> >> corresponds to a block that no longer exists on disk; in fact, the whole
> >> table segment it belonged to is gone. Or maybe the block or file number
> >> in the shared buffer header is corrupted somehow.
> >>
> >> I imagine you're seeing errors like this during each checkpoint attempt?
> >
> > Hi Tom,
> > Thanks for the reply.
> > Yes, I tried a pg_start_backup() to force a checkpoint and it failed due
> to
> > similar error.
> >
> >>
> >> I can't think of any very good way to clean that up. What I'd try here
> >> is a forced database shutdown (immediate-mode stop) and see if it starts
> >> up cleanly. It might be that whatever caused this has also corrupted
> >> the back WAL and so WAL replay will result in the same or similar error.
> >> In that case you'll be forced to do a pg_resetxlog to get the DB to come
> >> up again. If so, a dump and reload and some manual consistency checking
> >> would be indicated :-(
> >
> > Before seeing this message, I restarted Postgres and it was able to get
> to a
> > consistent state at which point I reclustered the db without error and
> > everything appears to be fine. Any idea what caused this? Was it
> something
> > to do with the Vacuum Full?
>
> Block number 12125253 is bigger that any block we can find in
> base/2651908/652397108.1
> Should the table size be in the 100GB range or 2-3 GB range ?
> This should help decide: if in the former case, then probably at least
> a segment disappear or, in the later, the shared_buffer turn
> corrupted.
>

The DB was in the 200GB-300GB range when this happened. What would cause the
segment to go missing? Just wondering if there is any further action I
should take like filing a bug or if this is a known issue. Thanks for
everyone's help.

-Ken

>
> Ken, you didn't change RELSEG_SIZE, right ? (it needs to be change in
> source code before compile it yourself)
> In both case a hardware check is welcome I believe.
> --
> Cédric Villemain 2ndQuadrant
> http://2ndQuadrant.fr/ PostgreSQL : Expertise, Formation et Support
>

In response to

Responses

Browse pgsql-admin by date

  From Date Subject
Next Message Ken Caruso 2011-07-19 20:37:47 Bloat and Slow Vacuum Time on Toast
Previous Message Kevin Grittner 2011-07-19 19:09:54 Re: Replicating privileges from one user to another