Checkpoint throttling issues

From: Andres Freund <andres(at)anarazel(dot)de>
To: pgsql-hackers(at)postgresql(dot)org
Cc: Fabien COELHO <coelho(at)cri(dot)ensmp(dot)fr>
Subject: Checkpoint throttling issues
Date: 2015-10-19 10:10:44
Message-ID: 20151019101044.GD3485@alap3.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

working on the checkpoint sorting/flushing patch I noticed a number of
preexisting issues:

1) The progress passed to CheckpointWriteDelay() will often be wrong -
it's calculated as num_written / num_to_write, but num_written is only
incremented if the buffer hasn't since independently been written
out. That's bad because it mean's we'll think we're further and
further behind if there's independent writeout activity.

Simple enough to fix, we gotta split num_written into num_written
(for stats purposes) and num_processed (for progress).

This is pretty much a bug, but I'm a slightly worried about
backpatching a fix because it can have a rather noticeable
behavioural impact.

2) CheckpointWriteDelay()'s handling of config file changes et al is
pretty weird: The config is reloaded during a checkpoing iff it's not
an immediate checkpoint and we're on schedule. I see very little
justification for having the got_SIGHUP block inside that if block.

3) The static pg_usleep(100000L); in CheckpointWriteDelay() is a bad
idea.

On a system with low write activity (< 10 writes sec) this will lead
to checkpoints finishing too early as there'll always be around ~10
writes a second. On slow IO, say a sdcard, that's bad.

On system with a very high write throughput (say 1k+ buffers/sec) the
unconditional 100ms sleep while on schedule will mean we're far
behind after the sleep. This leads to rather bursty IO, and makes it
more likely to run into dirty buffer limits and such. Just reducing
the sleep from 100ms to 10ms leads to significant latency
improvements. On pgbench rate limited to 5k tps:

before:
number of transactions skipped: 70352 (1.408 %)
number of transactions above the 100.0 ms latency limit: 2817 (0.056 %)
latency average: 5.266 ms
latency stddev: 11.723 ms
rate limit schedule lag: avg 3.138 (max 0.000) ms

after:
number of transactions skipped: 41696 (0.834 %)
number of transactions above the 100.0 ms latency limit: 1736 (0.035 %)
latency average: 4.929 ms
latency stddev: 8.929 ms
rate limit schedule lag: avg 2.835 (max 0.000) ms

I think the sleep time should be computed adaptively based on the
number of buffers remaining and the remaining time. There's probably
better formulations, but that seems like an easy enough improvement
and considerably better than now.

4) It's a bit dubious to only pgstat_send_bgwriter() when on schedule.

Greetings,

Andres Freund

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Etsuro Fujita 2015-10-19 11:35:31 Re: Foreign join pushdown vs EvalPlanQual
Previous Message YUriy Zhuravlev 2015-10-19 09:51:49 Re: Some questions about the array.