Re: Partitioned checkpointing

From: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Partitioned checkpointing
Date: 2015-09-11 15:54:46
Message-ID: 55F2F946.30403@2ndquadrant.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 09/11/2015 03:56 PM, Simon Riggs wrote:
>
> The idea to do a partial pass through shared buffers and only write a
> fraction of dirty buffers, then fsync them is a good one.
>
> The key point is that we spread out the fsyncs across the whole
> checkpoint period.

I doubt that's really what we want to do, as it defeats one of the
purposes of spread checkpoints. With spread checkpoints, we write the
data to the page cache, and then let the OS to actually write the data
to the disk. This is handled by the kernel, which marks the data as
expired after some time (say, 30 seconds) and then flushes them to disk.

The goal is to have everything already written to disk when we call
fsync at the beginning of the next checkpoint, so that the fsync are
cheap and don't cause I/O issues.

What you propose (spreading the fsyncs) significantly changes that,
because it minimizes the amount of time the OS has for writing the data
to disk in the background to 1/N. That's a significant change, and I'd
bet it's for the worse.

>
> I think we should be writing out all buffers for a particular file
> in one pass, then issue one fsync per file. >1 fsyncs per file seems
> a bad idea.
>
> So we'd need logic like this
> 1. Run through shared buffers and analyze the files contained in there
> 2. Assign files to one of N batches so we can make N roughly equal sized
> mini-checkpoints
> 3. Make N passes through shared buffers, writing out files assigned to
> each batch as we go

What I think might work better is actually keeping the write/fsync
phases we have now, but instead of postponing the fsyncs until the next
checkpoint we might spread them after the writes. So with target=0.5
we'd do the writes in the first half, then the fsyncs in the other half.
Of course, we should sort the data like you propose, and issue the
fsyncs in the same order (so that the OS has time to write them to the
devices).

I wonder how much the original paper (written in 1996) is effectively
obsoleted by spread checkpoints, but the benchmark results posted by
Horikawa-san suggest there's a possible gain. But perhaps partitioning
the checkpoints is not the best approach?

regards

--
Tomas Vondra http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Teodor Sigaev 2015-09-11 15:59:48 Review: check existency of table for -t option (pg_dump) when pattern...
Previous Message Robert Haas 2015-09-11 15:51:36 Re: Speed up Clog Access by increasing CLOG buffers