Re: Data corruption after SAN snapshot

From: Terry Schmitt <tschmitt(at)schmittworks(dot)com>
To: Craig Ringer <ringerc(at)ringerc(dot)id(dot)au>
Cc: pgsql-admin(at)postgresql(dot)org
Subject: Re: Data corruption after SAN snapshot
Date: 2012-08-08 01:34:17
Message-ID: CAOOcysxktC6qkc0j4cruhtA5Ec5BLDb9Q0hX9YWkb8G7F1_JdA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-admin

Thanks Craig.

"# Brad's el-ghetto do-our-storage-stacks-lie?-script" I like it already :)

I may play around with that. Looks interesting. For everyone else, here's a
post describing the use of diskchecker:
http://brad.livejournal.com/2116715.html
I experimented with sysbench today, which was somewhat enlightening and it
clearly shows the impact that fsync/fdatasync has on the file system
performance. It's pretty obvious that fsync is writing out to disk simply
based on the throughput of each test.
Using pgbench is a good idea, as I can throw a high transaction rate at the
database and take a snapshot during the test. So far, executing pg_dumpall
seems to be fairly reliable for finding the corrupt objects after my
initial data load, but unfortunately much of the corruption has been with
indexes which pgdump will not expose.

Thanks for the input,
T

On Tue, Aug 7, 2012 at 6:11 PM, Craig Ringer <ringerc(at)ringerc(dot)id(dot)au> wrote:

> On 08/08/2012 06:23 AM, Terry Schmitt wrote:
>
> Anyone have a solid method to test if fdatasync is working correctly or
>> thoughts on troubleshooting this?
>>
>
> Try diskchecker.pl
>
> https://gist.github.com/**3177656 <https://gist.github.com/3177656>
>
> The other obvious step is that you've changed three things, so start
> isolation testing.
>
> - Test Postgres Plus Advanced Server 8.4, which you knew worked, on your
> new file system and OS.
>
> - Test PP9.1 on your new OS but with ext3, which you knew worked
>
> - Test PP9.1 on your new OS but with ext4, which should work if ext3 did
>
> - Test PP9.1 on a copy of your *old* OS with the old file system setup.
>
> - Test mainline PostgreSQL 9.1 on your new setup to see if it's PP
> specific.
>
> Since each test sounds moderately time consuming, you'll probably need to
> find a way to automate. I'd first see if I could reproduce the problem when
> running PgBench against the same setup that's currently failing, and if
> that reproduces the fault you can use PgBench with the other tests.
>
> --
> Craig Ringer
>
>

In response to

Responses

Browse pgsql-admin by date

  From Date Subject
Next Message Stephen Frost 2012-08-08 01:34:25 Re: Data corruption after SAN snapshot
Previous Message Craig Ringer 2012-08-08 01:11:02 Re: Data corruption after SAN snapshot