Re: Controlling Load Distributed Checkpoints

From: Heikki Linnakangas <heikki(at)enterprisedb(dot)com>
To: ITAGAKI Takahiro <itagaki(dot)takahiro(at)oss(dot)ntt(dot)co(dot)jp>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Controlling Load Distributed Checkpoints
Date: 2007-06-11 09:27:30
Message-ID: 466D1582.8080503@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers pgsql-patches

ITAGAKI Takahiro wrote:
> Heikki Linnakangas <heikki(at)enterprisedb(dot)com> wrote:
>
>> True. On the other hand, if we issue writes in essentially random order,
>> we might fill the kernel buffers with random blocks and the kernel needs
>> to flush them to disk as almost random I/O. If we did the writes in
>> groups, the kernel has better chance at coalescing them.
>
> If the kernel can treat sequential writes better than random writes,
> is it worth sorting dirty buffers in block order per file at the start
> of checkpoints? Here is the pseudo code:
>
> buffers_to_be_written =
> SELECT buf_id, tag FROM BufferDescriptors
> WHERE (flags & BM_DIRTY) != 0 ORDER BY tag.rnode, tag.blockNum;
> for { buf_id, tag } in buffers_to_be_written:
> if BufferDescriptors[buf_id].tag == tag:
> FlushBuffer(&BufferDescriptors[buf_id])
>
> We can also avoid writing buffers newly dirtied after the checkpoint was
> started with this method.

That's worth testing, IMO. Probably won't happen for 8.3, though.

>> I tend to agree that if the goal is to finish the checkpoint as quickly
>> as possible, the current approach is better. In the context of load
>> distributed checkpoints, however, it's unlikely the kernel can do any
>> significant overlapping since we're trickling the writes anyway.
>
> Some kernels or storage subsystems treat all I/Os too fairly so that user
> transactions waiting for reads are blocked by checkpoints writes. It is
> unavoidable behavior though, but we can split writes in small batches.

That's really the heart of our problems. If the kernel had support for
prioritizing the normal backend activity and LRU cleaning over the
checkpoint I/O, we wouldn't need to throttle the I/O ourselves. The
kernel has the best knowledge of what it can and can't do, and how busy
the I/O subsystems are. Recent Linux kernels have some support for read
I/O priorities, but not for writes.

I believe the best long term solution is to add that support to the
kernel, but it's going to take a long time until that's universally
available, and we have a lot of platforms to support.

>> I'm starting to feel we should give up on smoothing the fsyncs and
>> distribute the writes only, for 8.3. As we get more experience with that
>> and it's shortcomings, we can enhance our checkpoints further in 8.4.
>
> I agree with the only writes distribution for 8.3. The new parameters
> introduced by it (checkpoint_write_percent and checkpoint_write_min_rate)
> will continue to be alive without major changes in the future, but other
> parameters seem to be volatile.

I'm going to start testing with just distributing the writes. Let's see
how far that gets us.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message ohp 2007-06-11 09:35:49 Re: little PITR annoyance
Previous Message Heikki Linnakangas 2007-06-11 09:04:41 Re: Truncate Permission

Browse pgsql-patches by date

  From Date Subject
Next Message Pavel Stehule 2007-06-11 09:55:39 WIP: updatable cursors in plpgsql
Previous Message Greg Smith 2007-06-11 07:51:51 Re: Controlling Load Distributed Checkpoints