Re: Load Distributed Checkpoints test results

From: Greg Smith <gsmith(at)gregsmith(dot)com>
To: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Load Distributed Checkpoints test results
Date: 2007-06-20 20:07:02
Message-ID: Pine.GSO.4.64.0706201512070.2198@westnet.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, 20 Jun 2007, Heikki Linnakangas wrote:

> Another series with 150 warehouses is more interesting. At that # of
> warehouses, the data disks are 100% busy according to iostat. The 90%
> percentile response times are somewhat higher with LDC, though the
> variability in both the baseline and LDC test runs seem to be pretty high.

Great, this the exactly the behavior I had observed and wanted someone
else to independantly run into. When you're in 100% disk busy land, LDC
can shift the distribution of bad transactions around in a way that some
people may not be happy with, and that might represent a step backward
from the current code for them. I hope you can understand now why I've
been so vocal that it must be possible to pull this new behavior out so
the current form of checkpointing is still available.

While it shows up in the 90% figure, what happens is most obvious in the
response time distribution graphs. Someone who is currently getting a run
like #295 right now: http://community.enterprisedb.com/ldc/295/rt.html

Might be really unhappy if they turn on LDC expecting to smooth out
checkpoints and get the shift of #296 instead:
http://community.enterprisedb.com/ldc/296/rt.html

That is of course cherry-picking the most extreme examples. But it
illustrates my concern about the possibility for LDC making things worse
on a really overloaded system, which is kind of counter-intuitive because
you might expect that would be the best case for its improvements.

When I summarize the percentile behavior from your results with 150
warehouses in a table like this:

Test LDC % 90%
295 None 3.703
297 None 4.432
292 10 3.432
298 20 5.925
296 30 5.992
294 40 4.132

I think it does a better job of showing how LDC can shift the top
percentile around under heavy load, even though there are runs where it's
a clear improvement. Since there is so much variability in results when
you get into this territory, you really need to run a lot of these tests
to get a feel for the spread of behavior. I spent about a week of
continuously running tests stalking this bugger before I felt I'd mapped
out the boundaries with my app. You've got your own priorities, but I'd
suggest you try to find enough time for a more exhaustive look at this
area before nailing down the final form for the patch.

--
* Greg Smith gsmith(at)gregsmith(dot)com http://www.gregsmith.com Baltimore, MD

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Peter Eisentraut 2007-06-20 20:23:26 Re: GUC time unit spelling a bit inconsistent
Previous Message Heikki Linnakangas 2007-06-20 17:58:14 Re: Load Distributed Checkpoints test results