Quick Links

Re: Plug-pull testing worked, diskchecker.pl failed

From:	Chris Angelico <rosuav(at)gmail(dot)com>
To:	pgsql-general(at)postgresql(dot)org
Subject:	Re: Plug-pull testing worked, diskchecker.pl failed
Date:	2012-10-24 14:04:50
Message-ID:	CAPTjJmpXC+FM5U=kDxv+k-iK9Az=po9agkh34LvihSbrLpz+ug@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-general

On Tue, Oct 23, 2012 at 9:51 AM, Scott Marlowe <scott(dot)marlowe(at)gmail(dot)com> wrote:
> On Mon, Oct 22, 2012 at 7:17 AM, Chris Angelico <rosuav(at)gmail(dot)com> wrote:
>> After reading the comments last week about SSDs, I did some testing of
>> the ones we have at work - each of my test-boxes (three with SSDs, one
>> with HDD) subjected to multiple stand-alone plug-pull tests, using
>> pgbench to provide load. So far, there've been no instances of
>> PostgreSQL data corruption, but diskchecker.pl reported huge numbers
>> of errors.
>
> Try starting pgbench, and then halfway through the timeout for a
> checkpoint timeout issue a checkpoint and WHILE the checkpoint is
> still running THEN pull the plug.
>
> Then after bringing the server up (assuming pg starts up) see if
> pg_dump generates any errors.

Thanks for the tip. I've been flat-out at work these past few days and
haven't gotten around to testing in the middle of a checkpoint, but I
have done something that might also be of interest. It's inspired by a
combination of diskchecker and pgbench; a harness that puts the
database under load and retains a record of what's been done.

In brief: Create a table with N (eg 100) rows, then spin as fast as
possible, incrementing a counter against one random row and also
incrementing the "Total" counter. When the database goes down, wait
for it to come up again; when it does, check against the local copy of
the counters and report any discrepancies.

The code's written in Pike, using the same database connection logic
that we use in our actual application (well, some of our code is C++
and some is PHP, so this corresponds to one part of our app), so this
is roughly representative of real usage.

It's about a page or two of code: http://pastebin.com/UNTj642Y

Currently, all the key parameters (database connection info (which has
been censored for the pastebin version), pool size, thread count, etc)
are just variables visible in the script, simpler than parsing
command-line arguments.

Is this a useful and plausible testing methodology? It's definitely
showed up some failures. On a hard-disk, all is well as long as the
write-back cache is disabled; on the SSDs, I can't make them reliable.

Is a single table enough to test for corruption with?

Chris Angelico

In response to

Re: Plug-pull testing worked, diskchecker.pl failed at 2012-10-22 22:51:36 from Scott Marlowe

Responses

Re: Plug-pull testing worked, diskchecker.pl failed at 2012-10-24 16:18:53 from Scott Marlowe
Re: Plug-pull testing worked, diskchecker.pl failed at 2012-10-27 05:26:33 from Greg Smith

Browse pgsql-general by date

	From	Date	Subject
Next Message	Steve Litt	2012-10-24 15:42:21	Re: Need sql to pull data from terribly architected table
Previous Message	John Ashmead	2012-10-24 13:58:17	Postgres 9.2 & PostGis 1.5/2.0