Re: Re: Checkpointer split has broken things dramatically (was Re: DELETE vs TRUNCATE explanation)

From: Craig Ringer <ringerc(at)ringerc(dot)id(dot)au>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Jeff Janes <jeff(dot)janes(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, Daniel Farina <daniel(at)heroku(dot)com>, "Harold A(dot) Giménez" <harold(dot)gimenez(at)gmail(dot)com>
Subject: Re: Re: Checkpointer split has broken things dramatically (was Re: DELETE vs TRUNCATE explanation)
Date: 2012-07-18 04:57:53
Message-ID: 50064251.5010908@ringerc.id.au
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers pgsql-performance

On 07/18/2012 08:31 AM, Tom Lane wrote:
> Not sure if we need a whole "farm", but certainly having at least one
> machine testing this sort of stuff on a regular basis would make me feel
> a lot better.

OK. That's something I can actually be useful for.

My current qemu/kvm test harness control code is in Python since that's
what all the other tooling for the project I was using it for is in. Is
it likely to be useful for me to adapt that code for use for a Pg
crash-test harness, or will you need a particular tool/language to be
used? If so, which/what? I'll do pretty much anything except Perl. I'll
have a result for you more quickly working in Python, though I'm happy
enough to write it in C (or Java, but I'm guessing that won't get any
enthusiasm around here).

> One fairly simple test scenario could go like this:
>
> * run the regression tests
> * pg_dump the regression database
> * run the regression tests again
> * hard-kill immediately upon completion
> * restart database, allow it to perform recovery
> * pg_dump the regression database
> * diff previous and new dumps; should be the same
>
> The main thing this wouldn't cover is discrepancies in user indexes,
> since pg_dump doesn't do anything that's likely to result in indexscans
> on user tables. It ought to be enough to detect the sort of system-wide
> problem we're talking about here, though.

It also won't detect issues that only occur during certain points in
execution, under concurrent load, etc. Still, a start, and I could look
at extending it into some kind of "crash fuzzing" once the basics were
working.

> In general I think the hard part is automated reproduction of an
> OS-crash scenario, but your ideas about how to do that sound promising.

It's worked well for other testing I've done. Any writes that're still
in the guest OS's memory, write queues, etc are lost when kvm is killed,
just like a hard crash. Anything the kvm guest has flushed to "disk" is
on the host and preserved - either on the host's disks
(cache=writethrough) or at least in dirty writeback buffers in ram
(cache=writeback).

kvm can even do a decent job of simulating a BBU-equipped write-through
volume by allowing the host OS to do write-back caching of KVM's backing
device/files. You don't get to set a max write-back cache size directly,
but Linux I/O writeback settings provide some control.

My favourite thing about kvm is that it's just another command. It can
be run headless and controlled via virtual serial console and/or its
monitor socket. It doesn't require special privileges and can operate on
ordinary files. It's very well suited for hooking into test harnesses.

The only challenge with using kvm/qemu is that there have been some
breaking changes and a couple of annoying bugs that mean I won't be able
to support anything except pretty much the latest versions initially.
kvm is easy to compile and has limited dependencies, so I don't expect
that to be an issue, but thought it was worth raising.

--
Craig Ringer

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2012-07-18 05:56:05 Re: Checkpointer split has broken things dramatically (was Re: DELETE vs TRUNCATE explanation)
Previous Message Tom Lane 2012-07-18 04:31:53 Re: BUG #6733: All Tables Empty After pg_upgrade (PG 9.2.0 beta 2)

Browse pgsql-performance by date

  From Date Subject
Next Message Claudio Freire 2012-07-18 05:38:32 Re: Linux memory zone reclaim
Previous Message Craig Ringer 2012-07-18 04:20:39 Re: Checkpointer split has broken things dramatically (was Re: DELETE vs TRUNCATE explanation)