Re: checkpoint writeback via sync_file_range

From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Greg Smith <greg(at)2ndquadrant(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: checkpoint writeback via sync_file_range
Date: 2012-01-11 13:39:15
Message-ID: CA+TgmobXuvgwNpp3y0vMf6_1n_wDO3SV=DuZC75KM0avEkZ5PA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, Jan 10, 2012 at 11:38 PM, Greg Smith <greg(at)2ndquadrant(dot)com> wrote:
> What you're doing here doesn't care though, and I hadn't considered that
> SYNC_FILE_RANGE_WRITE could be used that way on my last pass through its
> docs.  Used this way, it's basically fsync without the wait or guarantee; it
> just tries to push what's already dirty further ahead of the write queue
> than those writes would otherwise be.

Well, my goal was to make sure they got into the write queue rather
than just sitting in memory while the kernel twiddles its thumbs. My
hope is that the kernel is smart enough that, when you put something
under write-out, the kernel writes it out as quickly as it can without
causing too much degradation in foreground activity. If that turns
out to be an incorrect assumption, we'll need a different approach,
but I thought it might be worth trying something simple first and
seeing what happens.

> One idea I was thinking about here was building a little hash table inside
> of the fsync absorb code, tracking how many absorb operations have happened
> for whatever the most popular relation files are.  The idea is that we might
> say "use sync_file_range every time <N> calls for a relation have come in",
> just to keep from ever accumulating too many writes to any one file before
> trying to nudge some of it out of there. The bat that keeps hitting me in
> the head here is that right now, a single fsync might have a full 1GB of
> writes to flush out, perhaps because it extended a table and then write more
> than that to it.  And in everything but a SSD or giant SAN cache situation,
> 1GB of I/O is just too much to fsync at a time without the OS choking a
> little on it.

That's not a bad idea, but there's definitely some potential down
side: you might end up reducing write-combining quite significantly if
you keep pushing things out to files when it isn't really needed yet.
I was aiming to only push things out when we're 100% sure that they're
going to have to be fsync'd, and certainly any already-written buffers
that are in the OS cache at the start of a checkpoint fall into that
category. That having been said, experimental evidence is king.

> I'll put this into my testing queue after the upcoming CF starts.

Thanks!

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Satoshi Nagayasu 2012-01-11 13:54:23 Re: log messages for archive recovery progress
Previous Message Pavel Stehule 2012-01-11 13:38:21 Re: JSON for PG 9.2