Redesigning checkpoint_segments

From: Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
To: PostgreSQL-development <pgsql-hackers(at)postgreSQL(dot)org>
Subject: Redesigning checkpoint_segments
Date: 2013-06-05 12:16:00
Message-ID: 51AF2C00.5020704@vmware.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

checkpoint_segments is awkward. From an admin's point of view, it
controls two things:

1. it limits the amount of disk space needed for pg_xlog. (it's a soft
limit, but still)
2. it limits the time required to recover after a crash.

For limiting the disk space needed for pg_xlog, checkpoint_segments is
awkward because it's defined in terms of 16MB segments between
checkpoints. It takes a fair amount of arithmetic to calculate the disk
space required to hold the specified number of segments. The manual
gives the formula: (2 + checkpoint_completion_target) *
checkpoint_segments + 1, which amounts to about 1GB per 20 segments as a
rule of thumb. We shouldn't impose that calculation on the user. It
should be possible to just specify "checkpoint_segments=512MB", and the
system would initiate checkpoints so that the total size of WAL in
pg_xlog stays below 512MB.

For limiting the time required to recover after crash,
checkpoint_segments is awkward because it's difficult to calculate how
long recovery will take, given checkpoint_segments=X. A bulk load can
use up segments really fast, and recovery will be fast, while segments
full of random deletions can need a lot of random I/O to replay, and
take a long time. IMO checkpoint_timeout is a much better way to control
that, although it's not perfect either.

A third point is that even if you have 10 GB of disk space reserved for
WAL, you don't want to actually consume all that 10 GB, if it's not
required to run the database smoothly. There are several reasons for
that: backups based on a filesystem-level snapshot are larger than
necessary, if there are a lot of preallocated WAL segments and in a
virtualized or shared system, there might be other VMs or applications
that could make use of the disk space. On the other hand, you don't want
to run out of disk space while writing WAL - that can lead to a PANIC in
the worst case.

In VMware's vPostgres fork, we've hacked the way that works, so that
there is a new setting, checkpoint_segments_max that can be set by the
user, but checkpoint_segments is adjusted automatically, on the fly. The
system counts how many segments were consumed during the last checkpoint
cycle, and that becomes the checkpoint_segments setting for the next
cycle. That means that in a system with a steady load, checkpoints are
triggered by checkpoint_timeout, and the effective checkpoint_segments
value converges at the exact number of segments needed for that. That's
simple but very effective. It doesn't behave too well with bursty load,
however; during quiet times, checkpoint_segments is dialed way down, and
when the next burst comes along, you get several checkpoints in quick
succession, until checkpoint_segments is dialed back up again.

I propose that we do something similar, but not exactly the same. Let's
have a setting, max_wal_size, to control the max. disk space reserved
for WAL. Once that's reached (or you get close enough, so that there are
still some segments left to consume while the checkpoint runs), a
checkpoint is triggered.

In this proposal, the number of segments preallocated is controlled
separately from max_wal_size, so that you can set max_wal_size high,
without actually consuming that much space in normal operation. It's
just a backstop, to avoid completely filling the disk, if there's a
sudden burst of activity. The number of segments preallocated is
auto-tuned, based on the number of segments used in previous checkpoint
cycles.

I'll write up a patch to do that, but before I do, does anyone disagree
on those tuning principles? How do you typically tune
checkpoint_segments on your servers? If the system was to tune it
automatically, what formula should it use?

- Heikki

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2013-06-05 13:50:07 Re: Time for beta2 ?
Previous Message Amit Kapila 2013-06-05 11:24:28 Re: Proposal for Allow postgresql.conf values to be changed via SQL [review]