Re: sorted writes for checkpoints

From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Itagaki Takahiro <itagaki(dot)takahiro(at)gmail(dot)com>
Cc: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, Jeff Janes <jeff(dot)janes(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: sorted writes for checkpoints
Date: 2010-10-29 13:17:16
Message-ID: AANLkTimXzBnRkmwBsTAfp0YEdgCLp21WMmEOQO4NdzeM@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Fri, Oct 29, 2010 at 2:58 AM, Itagaki Takahiro
<itagaki(dot)takahiro(at)gmail(dot)com> wrote:
> On Fri, Oct 29, 2010 at 3:23 PM, Heikki Linnakangas
> <heikki(dot)linnakangas(at)enterprisedb(dot)com> wrote:
>> Simon's argument in the thread that the todo item points to
>> (http://archives.postgresql.org/pgsql-patches/2008-07/msg00123.php) is
>> basically that we don't know what the best algorithm is yet and benchmarking
>> is a lot of work, so let's just let people do whatever they feel like until
>> we settle on the best approach. I think we need to bite the bullet and do
>> some benchmarking, and commit one carefully vetted patch to the backend.
>
> When I submitted the patch, I tested it on disk-based RAID-5 machine:
> http://archives.postgresql.org/pgsql-hackers/2007-06/msg00541.php
> But there were no additional benchmarking reports at that time. We still
> need benchmarking before we re-examine the feature. For example, SSD and
> SSD-RAID was not popular at that time, but now they might be considerable.

There are really two separate things here:

(1) trying to do all the writes to file A before you start doing
writes to file B, and
(2) trying to write out blocks to each file in ascending logical block
number order

I'm much more convinced of the value of #1 than I am of the value of
#2. If we do #1, we can then spread out the checkpoint fsyncs in a
meaningful way without fearing that we'll need to fsync the same file
a second time for the same checkpoint. We've gotten some pretty
specific reports of problems in this area recently, so it seems likely
that there is some value to be had there. On the other hand, #2 is
only a win if sorting the blocks in numerical order causes the OS to
write them in a better order than it would otherwise have done. We've
had recent reports that our block-at-a-time relation extension policy
is leading to severe fragmentation on certain filesystems, so I'm a
bit skeptical about the value of this (though, of course, that can be
overturned if we can collect meaningful evidence).

> I think direct patching to the core is enough at the first
> testing, and we will decide the interface according to the
> result. If one algorithm win in all cases, we could just
> include it in the core, and then extensibility would not need.

I agree with this, and with Heikki's remarks also.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2010-10-29 14:22:48 Re: sorted writes for checkpoints
Previous Message Leonardo Francalanci 2010-10-29 08:57:20 Re: plan time of MASSIVE partitioning ...