Re: Perl script failure => Postgres 7.1.2 database corruption

From: Frank McKenney <frank_mckenney(at)mindspring(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Ethan Burnside <support(at)kattare(dot)com>, "PostgreSQL Bug List" <pgsql-bugs(at)postgresql(dot)org>
Subject: Re: Perl script failure => Postgres 7.1.2 database corruption
Date: 2001-11-09 13:47:42
Message-ID: 200111091847.NAA05962@barry.mail.mindspring.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

** Reply to message from Tom Lane
<tgl(at)sss(dot)pgh(dot)pa(dot)us>
on Fri, 09 Nov 2001 12:36:38 -0500

Tom,

Thanks for taking the time to reply. I have to say that I'm very
impressed with the "expressive power" and performance of
PostgreSQL's inplementation of SQL; I used it in this situation
because "it was there" when I took over the project, but what
I'm doing would be difficult or impossible with other SQL
products I've seen.

(Or maybe it just lets me get away with really arcane syntax I
shouldn't be using (;-))

> Frank McKenney <frank_mckenney(at)mindspring(dot)com> writes:
> > 1) Are there circumstances under which a "space exceeded" error on
> > a client machine _can_ damage a database on the Postgres server
> > machine?
>
> I don't see how. The error messages you cite all point to the idea
> that there was some internal corruption in the database. I'd venture
> more than one corrupted block, in fact. It appears that block 1444
> of your summary table's toast relation was clobbered (probably zeroed
> out in whole or in part), and the "relation nnnnn does not exist"
> complaints look like some bad things had happened to one or more system
> tables as well.

Hm. So it's likely that the only reason that site users didn't see
the corruption while all this was going on was that they didn't
happen to invoke that particular portion of that particular table?
That would make sense -- the site is lightly loaded right now; it's
status is "beta moving into release" and only 8 users have access
for database queries.

If it makes any difference (and I wish I had remembered this
earlier), the 'summary' table contains large formatted HTML-text
fields, on the order of 8-14K. We needed Postgres 7.1 to do this.

--

So it's likely (based on very limited evidence) that whatever
corrupted the database was something limited entirely to the
server's environment (hardware/OS/software)? Or, to put it another
way since I'm wearing my "application programmer" in this situation
(as opposed to "systems programmer"), there's not much I can do to
prevent this from recurring?

Other than sacrificing the occasional goat, of course (;-).

> Unfortunately, since you deleted the database, all the evidence is
> gone and there's no longer much hope of learning any more. If
something
> like this happens again, it might be worth tar'ing up the $PGDATA
> tree (while the postmaster is stopped) for possible forensic analysis.

It was a "tight" situation, and I confess to leaning rather heavily
on the 'web host's support person to delete it. My main interest,
at 23:30 or so, was to get _somthing_ working cleanly and properly
before morning (I finished, what with post-rebuild testing and all,
around 03:30).

Looking back, I suppose we could have renamed all the tables to
bad_summary, bad_xxxx, etc. However, since I'm fairly new to SQL
(let alone Postgres), I think I'd have been concerned that the
corruption might "leak out" at some point and corrupt other tables.
I've seen this happen in other (non-Postgres) situations.

Is there a nicely documented procedure for renaming/moving an entire
database to a different account that I overlooked in my haste? I've
skimmed the 7.1 Reference manual to see if it listed anything that
looked right, but if it's in there I missed it.

> > 3) What other things could we have tried to recover this situation?
>
> I think you were pretty much out of luck on that database, though
> perhaps partial data recovery could have been made if you were willing
> to spend time on it. A more interesting thing to worry about is how to
> ensure it doesn't happen again, and here my advice would be to look at
> the reliability of your disk drives and I/O hardware. I've seen more
> than one report of mysterious data clobbers that eventually traced to
> bogus disk controllers, flaky RAM, etc. In particular, I recall
> several data-block-suddenly-became-zero failures with hardware origins,
> and none that traced to software problems...

Thank you for cc:ing my web hosting service support group. I'll
follow up with them on this.

Is there a simple, light-load way of testing the integrity of a
given table/database? Experience seems to indicate that

pg_dump -what -ever database > /dev/null

would do a fairly good job on the entire database, but it would be
nice if there was a 20-second 'verifydb' program. Is there anything
else available along those lines designed for that specific purpose?

Is there a FAQ anywhere that discusses Postgres database corruption
and repair?

Anyone know who's writing Chapter 10 for the next release of the
"PostgreSQL Administrator's Guide"? It pops up all _over_ the place
when you use Google to try to find out how to fix a Postgres
database (;-).

In any case, thanks for the feedback. We learn, we try to plan for
"next time", and Murphy ensures that we'll have many more such
"educational opportunities" (;-).

Frank McKenney, McKenney Associates
Richmond, Virginia / (804) 320-4887
E-mail: frank_mckenney(at)mindspring(dot)com

In response to

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message Stephan Szabo 2001-11-09 16:03:22 Re: Bug #512: outer join bug
Previous Message Victor Marinescu 2001-11-09 08:38:11 postgresql 7.1 driver