Re: Redesigning checkpoint_segments

From: Josh Berkus <josh(at)agliodbs(dot)com>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Redesigning checkpoint_segments
Date: 2013-06-06 17:24:13
Message-ID: 51B0C5BD.80200@agliodbs.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers


>> Then I suggest we not use exactly that name. I feel quite sure we
>> would get complaints from people if something labeled as "max" was
>> exceeded -- especially if they set that to the actual size of a
>> filesystem dedicated to WAL files.
>
> You're probably right. Any suggestions for a better name?
> wal_size_soft_limit?

"checkpoint_size_limit", or something similar. That is, what you're
defining is:

"this is the size at which we trigger a checkpoint even if
checkpoint_timeout has not been exceeded".

However, I think it's worth considering: if we're doing this "sizing
checkpoints based on prior cycles" thing, do we really need a size_limit
*at all* for most users? I can see how a hard limit is useful, but not
how a soft limit is.

Most of our users most of the time don't care how large WAL is as long
as it doesn't exceed disk space. And on most databases, hitting
checkpoint_timeout is more frequent than hitting checkpoint_segments --
at least in my substantial performance-tuning experience. So I think
most users would prefer a setting which essentially says "make WAL as
big as it has to be in order to maximize throughput", and wouldn't worry
about the disk space.

>
> Yeah, something like that :-). I was thinking of letting the estimate
> decrease like a moving average, but react to any increases immediately.
> Same thing we do in bgwriter to track buffer allocations:

Seems reasonable. Given the behavior of xlog, I'd want to adjust the
algo so that peak usage on a 24-hour basis would affect current
preallocation. That is, if a site regularly has a peak from 2-3pm where
they're using 180 segments/cycle, then they should still be somewhat
higher at 2am than a database which doesn't have that peak. I'm pretty
sure that the bgwriter's moving average cycles much shorter time scales
than that.

>> Well, the ideal unit from the user's point of view is *time*, not space.
>> That is, the user wants the master to keep, say, "8 hours of
>> transaction logs", not any amount of MB. I don't want to complicate
>> this proposal by trying to deliver that, though.
>
> OTOH, if you specify it in terms of time, then you don't have any limit
> on the amount of disk space required.

Well, the best setup from my perspective as a remote DBA for a lot of
clients would be two-factor:

wal_keep_time: ##hr
wal_keep_size_limit: ##GB

That is, we would try to keep ##hr of WAL around for the standbys,
unless that amount exceeded ##GB (at which point we'd write a warning to
the logs). If max_wal_size was a hard limit, we wouldn't need
wal_keep_size_limit, of course.

However, to some degree Andres' work will render all this
wal_keep_segments stuff obsolete by letting the master track what
segment was last consumed by each replica, so I don't think it's worth
pursuing this line of thinking a lot further.

In any case, I'm just pointing out that we need to think of
wal_keep_segments as part of the total WAL size, and not as something
seperate, because that's confusing our users.

--
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Jeff Janes 2013-06-06 18:41:49 Re: Redesigning checkpoint_segments
Previous Message Dmitriy Igrishin 2013-06-06 17:12:43 Multiple error reports.