Re: pgsql-server/src backend/storage/buffer/bufmgr ...

From: Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Jan Wieck <JanWieck(at)Yahoo(dot)com>, pgsql-committers(at)postgresql(dot)org
Subject: Re: pgsql-server/src backend/storage/buffer/bufmgr ...
Date: 2004-01-26 19:33:03
Message-ID: 200401261933.i0QJX3E23616@candle.pha.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-committers

Tom Lane wrote:
> Jan Wieck <JanWieck(at)Yahoo(dot)com> writes:
> > So far nobody bothered to make any other proposal how to cause the
> > kernel to actually do some writing at all. A lot of people babble about
> > fsync(), fdatasync() and fadvise and whatnot. A week ago I posted the
> > proposal for this and got exactly zero response.
>
> As I've said before, I think we need to find a way to stop using sync()
> altogether --- we have to move to fsync or O_SYNC and variants. sync
> has simply got the wrong API.
>
> Let me give an example: you write a bunch of stuff and then call sync().
> Suppose the kernel is unable to write some of those blocks --- it gets
> a hard I/O error, or doesn't realize it's out of disk space until the
> write is attempted, or whatever. (I think this is what happened to
> Chris K-L last night.) Is the sync call going to tell you about the
> problem? No, it is not. If you are lucky you will get an error return
> from the next operation you try on a file descriptor associated with the
> failed blocks. But by that time you've probably already written a
> checkpoint record to WAL claiming that those writes were all done
> successfully. Finding out about the failures after the checkpoint is
> completed is too late --- you're screwed, especially if a crash happens
> before you can do anything about it.

If sync failes (kernel to disk write failes) we have a hardware failure,
and we don't pretend to recover from that, though it would be nice to
know sooner so we can exit. One idea I floated around was to
open/write/fsync/close a temporary file after sync in the hope that it
would happen after the sync completes because the fsync would be at the
end of the disk flush queue. However, tagged queueing could reorder
those, but hopefully it would catch a disk error before we recycle the
WAL files.

>
> > The whole point of the bgwriter is to give responsetimes a better
> > variance, I never claimed that it will improve performance.
>
> I want to use it to improve reliability, by getting rid of our
> dependence on sync(). The bgwriter can afford to wait for writes
> to occur, so it should be able to use fsync or even O_SYNC.

But I always wonder how to do that while allowing the reordering of
writes done by the kernel and disk drive, and good background writer
performance of moving pages out of the buffer cache.

--
Bruce Momjian | http://candle.pha.pa.us
pgman(at)candle(dot)pha(dot)pa(dot)us | (610) 359-1001
+ If your life is a hard drive, | 13 Roberts Road
+ Christ can be your backup. | Newtown Square, Pennsylvania 19073

In response to

Responses

Browse pgsql-committers by date

  From Date Subject
Next Message Bruce Momjian 2004-01-26 19:35:50 Re: pgsql-server/src backend/storage/buffer/bufmgr ...
Previous Message Bruce Momjian 2004-01-26 19:29:28 Re: pgsql-server/src backend/storage/buffer/bufmgr ...