Re: BUG #5929: ERROR: found toasted toast chunk for toast value 260340218 in pg_toast_260339342

From: Tambet Matiisen <tambet(dot)matiisen(at)gmail(dot)com>
To: Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>
Cc: pgsql-bugs(at)postgresql(dot)org
Subject: Re: BUG #5929: ERROR: found toasted toast chunk for toast value 260340218 in pg_toast_260339342
Date: 2011-03-16 19:08:07
Message-ID: 4D810A97.3000602@gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

On 16.03.2011 17:09, Kevin Grittner wrote:
> Tambet Matiisen<tambet(dot)matiisen(at)gmail(dot)com> wrote:
>
>> Pre-live database is restored from live database dump every night.
>
> How is that done? A single pg_dump of the entire live database
> restored using psql? Are both database servers at the same
> PostgreSQL version?

Yes, I use pg_dump on live server and the result is rdiff-backupped into
development server. Whole SQL dump is 12G without compression and the
rdiff delta is about 10-20MB every day. Then I drop pre-live database on
development server and recreate it using createdb and psql.

For a while development server was running 8.4 and live server 8.1. Now
both are 8.4, but this shouldn't matter, as I do backup and restore via SQL.

>
>> So far the errors have been in pre-live database,
>
> You're running pg_dump against a database you just restored from a
> pg_dump image?

Hmm, yeah. This sounds rather dumb, but haven't got to that yet.
Development server contains some additional databases as well, that do
not exist on live server.

>
>> Usually the next day error was gone. I mostly blamed badly timed
>> backup and restore scripts, although this shouldn't result in
>> errors.
>
> No it shouldn't -- if you're following any of the documented backup
> and restore techniques. I have a suspicion that you're just doing a
> file copy without stopping the live database or properly following
> the documented PITR backup and recovery techniques.

No, I don't do any advanced backup tricks. Just plain pg_dump and psql.

>
> This time the error is not in pre-live database and therefore it
>> doesn't go away.
>
> If I understand you, this sounds like corruption in the live
> database; nothing on the pre-live database is part of causing this
> problem.

This would be the case when I do filesystem level copy, but I do not.

>
>> The server is also running [...] Samba [...]
>
> I hope you're not trusting Samba too far. For a while we were using
> it in backups across our WAN, and it mangled at least one file
> almost every day. We had to take to running md5sum against both
> ends for each file to ensure we didn't get garbage (until we
> converted everything to use TCP communications, which have never
> mangled anything for us).

As I said, I'm using rdiff-backup to transfer pure SQL files.

>
>> Both fsync and full_page_writes are on.
>
> Good. Without those an OS or hardware crash can corrupt your
> database.

Actually they are commented out, but I suppose this means "on".

>
>> OK, I don't have UPS for this machine, but power has been stable.
>> Current uptime is 32 days, which I bet is from the last kernel
>> update.
>
> OK. A power outage wouldn't be too likely to matter if you have
> fsync and full_page_writes on.

That's a relief :).

>
>> Currently I blame either faulty memory or faulty software RAID
>> driver. I can easily eliminate the memory cause by running
>> memtest86 for few hours
>
> Is this ECC memory? If not, even a good test doesn't prove that a
> RAM problem didn't cause the corruption.

It's not ECC memory.

>
>> Now, off to buy UPS...
>
> Not a bad idea, but it doesn't sound like lack of that is likely to
> have caused the corruption in your live database, based on the
> settings you mentioned. (Assuming those settings are in use on the
> live server.)

Checked live server, it has also fsync=on and full_page_writes=on. But
it shouldn't matter, because backup of live server doesn't give any errors.

It is possible, that restore of pre-live database using psql lasts so
long, that backup of the same database using pg_dump is already kicking
in. But again, this shouldn't matter and it doesn't explain why the last
error is in another database, that hasn't changed for months.

Now I have to find time to run memtest.

Tambet

In response to

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message Kevin Grittner 2011-03-16 20:29:35 Re: BUG #5929: ERROR: found toasted toast chunk for toast value 260340218 in pg_toast_260339342
Previous Message Robert Brewer 2011-03-16 18:14:31 Re: SELECT '(1, nan, 3)'::cube;