Re: Controlling Load Distributed Checkpoints

From: Heikki Linnakangas <heikki(at)enterprisedb(dot)com>
To: Greg Smith <gsmith(at)gregsmith(dot)com>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Controlling Load Distributed Checkpoints
Date: 2007-06-07 08:36:53
Message-ID: 4667C3A5.9090008@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers pgsql-patches

Greg Smith wrote:
> On Wed, 6 Jun 2007, Heikki Linnakangas wrote:
>
>> The original patch uses bgwriter_all_max_pages to set the minimum
>> rate. I think we should have a separate variable,
>> checkpoint_write_min_rate, in KB/s, instead.
>
> Completely agreed. There shouldn't be any coupling with the background
> writer parameters, which may be set for a completely different set of
> priorities than the checkpoint has. I have to look at this code again
> to see why it's a min_rate instead of a max, that seems a little weird.

It's min rate, because it never writes slower than that, and it can
write faster if the next checkpoint is due soon so that we wouldn't
finish before it's time to start the next one. (Or to be precise, before
the next checkpoint is closer than 100-(checkpoint_write_percent)% of
the checkpoint interval)

>> Nap phase: We should therefore give the delay as a number of seconds
>> instead of as a percentage of checkpoint interval.
>
> Again, the setting here should be completely decoupled from another GUC
> like the interval. My main complaint with the original form of this
> patch was how much it tried to syncronize the process with the interval;
> since I don't even have a system where that value is set to something,
> because it's all segment based instead, that whole idea was incompatible.

checkpoint_segments is taken into account as well as checkpoint_timeout.
I used the term "checkpoint interval" to mean the real interval at which
the checkpoints occur, whether it's because of segments or timeout.

> The original patch tried to spread the load out as evenly as possible
> over the time available. I much prefer thinking in terms of getting it
> done as quickly as possible while trying to bound the I/O storm.

Yeah, the checkpoint_min_rate allows you to do that.

So there's two extreme ways you can use LDC:
1. Finish the checkpoint as soon as possible, without disturbing other
activity too much. Set checkpoint_write_percent to a high number, and
set checkpoint_min_rate to define "too much".
2. Disturb other activity as little as possible, as long as the
checkpoint finishes in a reasonable time. Set checkpoint_min_rate to a
low number, and checkpoint_write_percent to define "reasonable time"

Are both interesting use cases, or is it enough to cater for just one of
them? I think 2 is easier to tune. Defining the min_rate properly can be
difficult and depends a lot on your hardware and application, but a
default value of say 50% for checkpoint_write_percent to tune for use
case 2 should work pretty well for most people.

In any case, the checkpoint better finish before it's time to start
another one. Or would you rather delay the next checkpoint, and let
checkpoint take as long as it takes to finish at the min_rate?

>> And we don't know how much work an fsync performs. The patch uses the
>> file size as a measure of that, but as we discussed that doesn't
>> necessarily have anything to do with reality. fsyncing a 1GB file with
>> one dirty block isn't any more expensive than fsyncing a file with a
>> single block.
>
> On top of that, if you have a system with a write cache, the time an
> fsync takes can greatly depend on how full it is at the time, which
> there is no way to measure or even model easily.
>
> Is there any way to track how many dirty blocks went into each file
> during the checkpoint write? That's your best bet for guessing how long
> the fsync will take.

I suppose it's possible, but the OS has hopefully started flushing them
to disk almost as soon as we started the writes, so even that isn't very
good a measure.

On a Linux system, one way to model it is that the OS flushes dirty
buffers to disk at the same rate as we write them, but delayed by
dirty_expire_centisecs. That should hold if the writes are spread out
enough. Then the amount of dirty buffers in OS cache at the end of write
phase is roughly constant, as long as the write phase lasts longer than
dirty_expire_centisecs. If we take a nap of dirty_expire_centisecs after
the write phase, the fsyncs should be effectively no-ops, except that
they will flush any other writes the bgwriter lru-sweep and other
backends performed during the nap.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Hannu Krosing 2007-06-07 09:28:03 Re: Controlling Load Distributed Checkpoints
Previous Message Jeremy Drake 2007-06-07 05:57:47 Re: is_array_type vs type_is_array

Browse pgsql-patches by date

  From Date Subject
Next Message Heikki Linnakangas 2007-06-07 08:45:23 Re: contrib/pgstattuple Japanese documentation fix
Previous Message Andrew Dunstan 2007-06-07 02:42:22 Re: WIP csv logs