Re: Let PostgreSQL's On Schedule checkpoint write buffer smooth spread cycle by tuning IsCheckpointOnSchedule?

From: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Let PostgreSQL's On Schedule checkpoint write buffer smooth spread cycle by tuning IsCheckpointOnSchedule?
Date: 2015-12-14 23:08:43
Message-ID: 566F4BFB.7060802@2ndquadrant.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

I was planning to do some review/testing on this patch, but then I
noticed it was rejected with feedback in 2015-07 and never resubmitted
into another CF. So I won't waste time in testing this unless someone
shouts that I should do that anyway. Instead I'll just post some ideas
about how we might improve the patch, because I'd forget about them
otherwise.

On 07/05/2015 09:48 AM, Heikki Linnakangas wrote:
>
> The ideal correction formula f(x), would be such that f(g(X)) = X, where:
>
> X is time, 0 = beginning of checkpoint, 1.0 = targeted end of
> checkpoint (checkpoint_segments), and
>
> g(X) is the amount of WAL generated. 0 = beginning of checkpoint, 1.0
> = targeted end of checkpoint (derived from max_wal_size).
>
> Unfortunately, we don't know the shape of g(X), as that depends on the
> workload. It might be linear, if there is no effect at all from
> full_page_writes. Or it could be a step-function, where every write
> causes a full page write, until all pages have been touched, and after
> that none do (something like an UPDATE without a where-clause might
> cause that). In pgbench-like workloads, it's something like sqrt(x). I
> picked X^1.5 as a reasonable guess. It's close enough to linear that it
> shouldn't hurt too much if g(x) is linear. But it cuts the worst spike
> at the very beginning, if g(x) is more like sqrt(x).

Exactly. I think the main "problem" here is that we do mix two types of
WAL records, with quite different characteristics:

(a) full_page_writes - very high volume right after checkpoint, then
usually drops to much lower volume

(b) regular records - about the same volume over time (well, lower
volume right after the checkpoint, as that's where FPWs happen)

We completely ignore this when computing elapsed_xlogs, because we
compute it (about) like this:

elapsed_xlogs = wal_since_checkpoint / CheckPointSegments;

which of course gets confused when we write a lot of WAL right after a
checkpoint, because of FPW. But what if we actually tracked the amount
of WAL produced by FWP in a checkpoint (which we current don't AFAIK)?

Then we could compute the expected *remaining* amount of WAL to be
produced within the checkpoint interval, and use that to compute a
better progress like this:

wal_bytes - WAL (total)
wal_fpw_bytes - WAL (due to FPW)
prev_wal_bytes - WAL (total) in previous checkpoint
prev_wal_fpw_bytes - WAL (due to FPW) in previous checkpoint

So we know that we should expect about

(prev_wal_bytes - wal_bytes) + (prev_wal_fpw_bytes - wal_fpw_bytes)

( regular WAL ) + ( FPW WAL )

to be produced until the end of the current checkpoint. I don't have a
clear idea how to transform this into the 'progress' yet, but I'm pretty
sure tracking the two types of WAL is a key to a better solution. The
x^1.5 is probably a step in the right direction, but I don't feel
particularly confident about the 1.5 (which is rather arbitrary).

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Michael Paquier 2015-12-15 00:23:53 Re: Function and view to retrieve WAL receiver status
Previous Message Daniel Verite 2015-12-14 22:15:14 Re: [patch] Proposal for \rotate in psql