Re: Spread checkpoint sync

From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, Itagaki Takahiro <itagaki(dot)takahiro(at)gmail(dot)com>, Greg Smith <greg(at)2ndquadrant(dot)com>, Ron Mayer <rm_pg(at)cheapcomplexdevices(dot)com>, Josh Berkus <josh(at)agliodbs(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Spread checkpoint sync
Date: 2011-01-31 16:43:01
Message-ID: AANLkTimj0rsfgLsXcSRRNpwNW-_F26_8-CgcDdsiQQj6@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Mon, Jan 31, 2011 at 11:29 AM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com> writes:
>> IMHO we should re-consider the patch to sort the writes. Not so much
>> because of the performance gain that gives, but because we can then
>> re-arrange the fsyncs so that you write one file, then fsync it, then
>> write the next file and so on.
>
> Isn't that going to make performance worse not better?  Generally you
> want to give the kernel as much scheduling flexibility as possible,
> which you do by issuing the write as far before the fsync as you can.
> An arrangement like the above removes all cross-file scheduling freedom.
> For example, if two files are on different spindles, you've just
> guaranteed that no I/O overlap is possible.
>
>> That way we the time taken by the fsyncs
>> is distributed between the writes,
>
> That sounds like you have an entirely wrong mental model of where the
> cost comes from.  Those times are not independent.

Yeah, Greg Smith made the same point a week or three ago. But it
seems to me that there is potential value in overlaying the write and
sync phases to some degree. For example, if the write phase is spread
over 15 minutes and you have 30 files, then by, say, minute 7, it's a
probably OK to flush the file you wrote first. Waiting longer isn't
necessarily going to help - the kernel has probably written what it is
going to write without prodding.

In fact, it might be that on a busy system, you could lose by waiting
*too long* to perform the fsync. The cleaning scan and/or backends
may kick out additional dirty buffers that will now have to get forced
down to disk, even though you don't really care about them (because
they were dirtied after the checkpoint write had already been done).

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2011-01-31 16:51:13 Re: Spread checkpoint sync
Previous Message Heikki Linnakangas 2011-01-31 16:31:13 Re: Allowing multiple concurrent base backups