Re: Controlling Load Distributed Checkpoints

From: Heikki Linnakangas <heikki(at)enterprisedb(dot)com>
To: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Cc: Greg Smith <gsmith(at)gregsmith(dot)com>, tgl(at)sss(dot)pgh(dot)pa(dot)us, Hannu Krosing <hannu(at)skype(dot)net>, ITAGAKI Takahiro <itagaki(dot)takahiro(at)oss(dot)ntt(dot)co(dot)jp>, Greg Stark <greg(dot)stark(at)enterprisedb(dot)com>
Subject: Re: Controlling Load Distributed Checkpoints
Date: 2007-06-07 12:23:06
Message-ID: 4667F8AA.4040300@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers pgsql-patches

Thinking about this whole idea a bit more, it occured to me that the
current approach to write all, then fsync all is really a historical
artifact of the fact that we used to use the system-wide sync call
instead of fsyncs to flush the pages to disk. That might not be the best
way to do things in the new load-distributed-checkpoint world.

How about interleaving the writes with the fsyncs?

1.
Scan all shared buffers, and build a list of all files with dirty pages,
and buffers belonging to them

2.
foreach(file in list)
{
foreach(buffer belonging to file)
{
write();
sleep(); /* to throttle the I/O rate */
}
sleep(); /* to give the OS a chance to flush the writes at it's own
pace */
fsync()
}

This would spread out the fsyncs in a natural way, making the knob to
control the duration of the sync phase unnecessary.

At some point we'll also need to fsync all files that have been modified
since the last checkpoint, but don't have any dirty buffers in the
buffer cache. I think it's a reasonable assumption that fsyncing those
files doesn't generate a lot of I/O. Since the writes have been made
some time ago, the OS has likely already flushed them to disk.

Doing the 1st phase of just scanning the buffers to see which ones are
dirty also effectively implements the optimization of not writing
buffers that were dirtied after the checkpoint start. And grouping the
writes per file gives the OS a better chance to group the physical writes.

One problem is that currently the segmentation of relations to 1GB files
is handled at a low level inside md.c, and we don't really have any
visibility into that in the buffer manager. ISTM that some changes to
the smgr interfaces would be needed for this to work well, though just
doing it on a relation per relation basis would also be better than the
current approach.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2007-06-07 14:16:25 Re: Controlling Load Distributed Checkpoints
Previous Message Devrim GÜNDÜZ 2007-06-07 11:22:22 Re: How do we create the releases?

Browse pgsql-patches by date

  From Date Subject
Next Message Tom Lane 2007-06-07 14:16:25 Re: Controlling Load Distributed Checkpoints
Previous Message ITAGAKI Takahiro 2007-06-07 10:57:50 Re: contrib/pgstattuple Japanese documentation fix