Re: Load distributed checkpoint

From: Bruce Momjian <bruce(at)momjian(dot)us>
To: Gregory Stark <stark(at)enterprisedb(dot)com>
Cc: PostgreSQL-development <pgsql-hackers(at)postgreSQL(dot)org>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>, "Jim C(dot) Nasby" <jim(at)nasby(dot)net>, ITAGAKI Takahiro <itagaki(dot)takahiro(at)oss(dot)ntt(dot)co(dot)jp>
Subject: Re: Load distributed checkpoint
Date: 2006-12-22 22:34:15
Message-ID: 200612222234.kBMMYFu01441@momjian.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers pgsql-patches

Gregory Stark wrote:
>
> "Bruce Momjian" <bruce(at)momjian(dot)us> writes:
>
> > I have a new idea. Rather than increasing write activity as we approach
> > checkpoint, I think there is an easier solution. I am very familiar
> > with the BSD kernel, and it seems they have a similar issue in trying to
> > smooth writes:
>
> Just to give a bit of context for this. The traditional mechanism for syncing
> buffers to disk on BSD which this daemon was a replacement for was to simply
> call "sync" every 30s. Compared to that this daemon certainly smooths the I/O
> out over the 30s window...
>
> Linux has a more complex solution to this (of course) which has undergone a
> few generations over time. Older kernels had a user space daemon called
> bdflush which called an undocumented syscall every 5s. More recent ones have a
> kernel thread called pdflush. I think both have various mostly undocumented
> tuning knobs but neither makes any sort of guarantee about the amount of time
> a dirty buffer might live before being synced.
>
> Your thinking is correct but that's already the whole point of bgwriter isn't
> it? To get the buffers out to the kernel early in the checkpoint interval so
> that come checkpoint time they're hopefully already flushed to disk. As long
> as your checkpoint interval is well over 30s only the last 30s (or so, it's a
> bit fuzzier on Linux) should still be at risk of being pending.
>
> I think the main problem with an additional pause in the hopes of getting more
> buffers synced is that during the 30s pause on a busy system there would be a
> continual stream of new dirty buffers being created as bgwriter works and
> other backends need to reuse pages. So when the fsync is eventually called
> there will still be a large amount of i/o to do. Fundamentally the problem is
> that fsync is too blunt an instrument. We only need to fsync the buffers we
> care about, not the entire file.

Well, one idea would be for the bgwriter not to do many write()'s
between the massive checkpoint write()'s and the fsync()'s. That would
cut down on the extra I/O that fsync() would have to do.

The problem I see with making the bgwriter do more writes between
checkpoints is that overhead of those scans, and the overhead of doing
write's that will later be dirtied before the checkpoint. With the
delay between stages idea, we don't need to guess how agressive the
bgwriter needs to be --- we can just do the writes, and wait for a
while.

On an idle system, would someone dirty a large file, and watch the disk
I/O to see how long it takes for the I/O to complete to disk?

In what we have now, we are either having the bgwriter do too much I/O
between checkpoints, or guaranteeing an I/O storm during a checkpoint by
doing lots of write()'s and then calling fsync() right away. I don't
see how we are ever going to get that properly tuned.

Would someone code up a patch and test it?

--
Bruce Momjian bruce(at)momjian(dot)us
EnterpriseDB http://www.enterprisedb.com

+ If your life is a hard drive, Christ can be your backup. +

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Bruce Momjian 2006-12-22 22:43:15 Re: Companies Contributing to Open Source
Previous Message Gregory Stark 2006-12-22 21:41:35 Re: Companies Contributing to Open Source

Browse pgsql-patches by date

  From Date Subject
Next Message Bruce Momjian 2006-12-23 01:59:22 Re: Patch(es) to expose n_live_tuples and
Previous Message Tom Lane 2006-12-22 21:54:26 Re: WIP patch for "operator families"