Re: Redesigning checkpoint_segments

From: Josh Berkus <josh(at)agliodbs(dot)com>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Redesigning checkpoint_segments
Date: 2013-06-05 20:16:05
Message-ID: 51AF9C85.5080700@agliodbs.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Heikki,

> We shouldn't impose that calculation on the user. It
> should be possible to just specify "checkpoint_segments=512MB", and the
> system would initiate checkpoints so that the total size of WAL in
> pg_xlog stays below 512MB.

Agreed.

> For limiting the time required to recover after crash,
> checkpoint_segments is awkward because it's difficult to calculate how
> long recovery will take, given checkpoint_segments=X. A bulk load can
> use up segments really fast, and recovery will be fast, while segments
> full of random deletions can need a lot of random I/O to replay, and
> take a long time. IMO checkpoint_timeout is a much better way to control
> that, although it's not perfect either.

This is true, but I don't see that your proposal changes this at all
(for the better or for the worse).

> A third point is that even if you have 10 GB of disk space reserved for
> WAL, you don't want to actually consume all that 10 GB, if it's not
> required to run the database smoothly.

Agreed.

> I propose that we do something similar, but not exactly the same. Let's
> have a setting, max_wal_size, to control the max. disk space reserved
> for WAL. Once that's reached (or you get close enough, so that there are
> still some segments left to consume while the checkpoint runs), a
> checkpoint is triggered.

Refinement of the proposal:

1. max_wal_size is a hard limit
2. checkpointing targets 50% of ( max_wal_size - wal_keep_segments )
to avoid lockup if checkpoint takes longer than expected.
3. wal_keep_segments is taken out of max_wal_size.
a. it automatically defaults to 20% of max_wal_size if
max_wal_senders > 0
b. for that reason, we don't allow it to be larger
than 80% of max_wal_size
4. preallocated WAL isn't allowed to shrink smaller than
wal_keep_segements + (max_wal_size * 0.1).

This would mean that I could set my server to:

max_wal_size = 2GB

and ...

* by default, 26 segments (416MB) would be kept for wal_keep_segments.
* checkpoint target would be 77 segments (1.2GB)
* preallocated WAL will always be at least 39 segments (624MB),
including keep_segments.

now, if I had a fairly low transaction database, but wanted to make sure
I could recover from an 8-hour break in replication, I might bump up
wal_keep_segments to 1GB. In that case:

* 64 segments (1GB) would be kept.
* checkpoints would target 96 segments (1.5GB)
* preallocated WAL would always be at least 77 segments (1.2GB)

> Hmm, haven't thought about that. I think a better unit to set
> wal_keep_segments in would also be MB, not segments.

Well, the ideal unit from the user's point of view is *time*, not space.
That is, the user wants the master to keep, say, "8 hours of
transaction logs", not any amount of MB. I don't want to complicate
this proposal by trying to deliver that, though.

> In this proposal, the number of segments preallocated is controlled
> separately from max_wal_size, so that you can set max_wal_size high,
> without actually consuming that much space in normal operation. It's
> just a backstop, to avoid completely filling the disk, if there's a
> sudden burst of activity. The number of segments preallocated is
> auto-tuned, based on the number of segments used in previous checkpoint
> cycles.

"based on"; can you give me your algorithmic thinking here? I'm
thinking we should have some calculation of last cycle size and peak
cycle size so that bursty workloads aren't compromised.

--
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Robert Haas 2013-06-05 20:49:03 Re: MVCC catalog access
Previous Message Giovanni Mascellani 2013-06-05 19:46:20 About large objects asynchronous and non-blocking support