Testing 8.3 LDC vs. 8.2.4 with aggressive BGW

From: Greg Smith <gsmith(at)gregsmith(dot)com>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Testing 8.3 LDC vs. 8.2.4 with aggressive BGW
Date: 2007-09-11 05:06:34
Message-ID: Pine.GSO.4.64.0709071630180.10175@westnet.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Renaming the old thread to more appropriately address the topic:

On Wed, 5 Sep 2007, Kevin Grittner wrote:

> Then I would test the new background writer with synchronous commits under
> the 8.3 beta, using various settings. The 0.5, 0.7 and 0.9 settings you
> recommended for a test are how far from the LRU end of the cache to look
> for dirty pages to write, correct?

This is alluding to the suggestions I gave at
http://archives.postgresql.org/pgsql-hackers/2007-08/msg00755.php

checkpoint_completion_target has nothing to do with the LRU, so let's step
back to fundamentals and talk about what it actually does. The official
documentation is at
http://developer.postgresql.org/pgdocs/postgres/wal-configuration.html

As you generate transactions, Postgres puts data into the WAL. The WAL is
organized into segments that are typically 16MB each. Periodically, the
system hits a checkpoint where the WAL data up to a certain point is
guaranteed to have been applied to the database, at which point the old
WAL files aren't needed anymore and can be reused. These checkpoints are
generally caused by one of two things happening:

1) checkpoint_segments worth of WAL files have been written
2) more than checkpoint_timeout seconds have passed since the last
checkpoint

The system doesn't stop working while the checkpoint is happening; it just
keeps creating new WAL files. As long as the checkpoint finishes in
advance of what the next one is required things performance should be
fine.

In the 8.2 model, processing the checkpoint occurs as fast as data can be
written to disk. In 8.3, the writes can be spread out instead. What
checkpoint_completion_target does is suggest how far along the system
should aim to have finished the current checkpoint relative to when the
next one is expected.

For example, your current system has checkpoint_segments=10. Assume that
you have checkpoint_timeout set to a large number such that the
checkpoints are typically being driven by the number of segments being
filled (so you get a checkpoint every 10 WAL segments, period). If
checkpoint_completion_target was set to 0.5, the expectation is that the
writes for the currently executing checkpoint would be finished about the
time that 0.5*10=5 segments of new WAL data had been written. If you set
it to 0.9 instead, you'd expect the checkpoint is finishing just about
when the 9th WAL segment is being written out, which is cutting things a
bit tight; somewhere around there is the safe upper limit for that
parameter.

Now, checkpoint_segments=10 is a pretty low setting, but I'm guessing that
on your current system that's forcing very regular checkpoints, which
makes each individual checkpoint have less work to do and therefore
reduces the impact of the spikes you're trying to avoid. With LDC and
checkpoint_completion_target, you can make that number much bigger (I
suggested 50), which means you'll only have 1/5 as many checkpoints
causing I/O spikes, and each of those checkpoints will have 5X as long to
potentially spread the writes over. The main cost is that it will take
longer to recover if your database crashes, which hopefully is a rare
event.

Having far less checkpoints is obviously a win for your situation, but the
open question is whether this fashion of spreading them out will reduce
the I/O spike as effectively as the all-scan background writer in 8.2 has
been working for you. This is one aspect that makes your comparision a
bit tricky. It's possible that by increasing the segments enough, you'll
get into a situation where you don't see (m)any of them during your
testing run of 8.3. You should try and collect some data on how regularly
checkpoints are happening during early testing to get an idea if this is a
possibility. The usual approach is to set checkpoint_warning to a really
high number (like the maximum of 3600) and then you'll get a harmless note
in the logs every time one happens, and that will show you how frequently
they're happening. It's kind of important to have an idea how many
checkpoints you can expect during each test run to put together a fair
comparison; as you increase checkpoint_segments, you need to adopt a
mindset that is considering "how many sluggish transactions am I seeing
per checkpoint?", not how many total per test run.

I have a backport of some of the pg_stat_bgwriter features added to 8.3
that can be applied to 8.2 that might be helpful for monitoring your test
benchmarking server (this is most certainly *not* suitable to go onto the
real one) at
http://www.westnet.com/~gsmith/content/postgresql/perfmon82.htm you might
want to take a look at; I put that together specifically for allowing
easier comparisions of 8.2 and 8.3 in this area.

> Are the current shared memory and the 1 GB you suggested enough of a
> spread for these tests? (At several hours per test in order to get
> meaningful results, I don't want to get into too many permutations.)

Having a much larger shared_buffers setting should allow you to keep more
data in memory usefully, which may lead to an overall performance gain due
to improved efficiency. With your current configuration, I would guess
that making the buffer cache bigger would increase the checkpoint spike
problems, where that shouldn't be as much of a problem with 8.3 because of
how the checkpoint can be spread out. The hope here is that by letting
PostgreSQL cache more and avoiding writes of popular buffers except at
checkpoint time, your total I/O will be significantly lower with 8.3
compared to how much an aggressive BGW will write in 8.2. Right now,
you've got a pretty low number of pages that accumulate a high usage
count; that may change if you give the buffer cache a lot more room to
work.

> Finally, I would try the new checkpoint techniques, with and without the
> new background writer. Any suggestions on where to set the knobs for
> those runs?

This and your related question about simulating the new LRU behavior by
"turning off the 'all' scan and setting the lru scan percentage to 50% or
more" depend on what final form the LRU background writer ends up in.
Certainly you should consider using a higher value for the percentage and
maxpages parameters with the current form 8.3 is in because you're not
having the all scan doing the majority of the work anymore. If some form
of my JIT BGW patch gets applied before beta, you'll still want to
increase maxpages but don't have to play with the percentage anymore; you
might try adjusting the multiplier setting instead.

> I'm inclined to think that it would be interesting to try the benchmarks
> with the backend writing any dirty page through to the OS at the same
> time they are written to the PostgreSQL cache, as a reference point at
> the opposite extreme from having the cache hold onto dirty pages for as
> long as possible before sharing them with the OS. Do you see any value
> in getting actual numbers for that?

It might be an interesting curiousity to see how this works for you, but
I'm not sure of its value to the community at large. The configuration
trend for larger systems seems to be pretty clear at this point: use
large values for shared_buffers and checkpoint_segments. Minimize total
I/O in the background writer by not writing more than you have to, only
even consider writing buffers that are going to be reused in the near
future regularly; everything else only gets written out at checkpoint
time. I consider the fact that you've gotten good results in the past by
a radically different configuration than what's considered normal best
practice, a configuration that works around problems in 8.2, an
interesting data point. I don't see any reason that anyone would jump
from there to expecting that turning the PostgreSQL cache into what's
essentially a write-through one the way you describe here will be helpful
in most cases, and I'm not sure how you would do it anyway.

What I would encourage you to take a look at while you're doing these
experiments is radically lowering the Linux dirty_background_ratio tunable
(perhaps even to 0) to see what that does for you. From what I've seen in
the past, the caching there is more likely to be the root of your problem.
Hopefully LDC will address your issue such that you don't have to adjust
this, because it will lower efficiency considerably, but it may be the
most straightforward way to get the more timely I/O path you're obviously
looking for.

--
* Greg Smith gsmith(at)gregsmith(dot)com http://www.gregsmith.com Baltimore, MD

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Jeff Davis 2007-09-11 05:32:32 Re: invalidly encoded strings
Previous Message Jeff Davis 2007-09-11 04:04:13 Re: invalidly encoded strings