Quick Links

Re: O_DIRECT in freebsd

From:	Sean Chittenden <sean(at)chittenden(dot)org>
To:	Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>
Cc:	"Jim C(dot) Nasby" <jim(at)nasby(dot)net>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: O_DIRECT in freebsd
Date:	2003-06-23 01:12:47
Message-ID:	20030623011247.GI97131@perrin.int.nxad.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

> > > Basically, we don't know when we read a buffer whether this is a
> > > read-only or read/write. In fact, we could read it in, and
> > > another backend could write it for us.
> >
> > Um, wait. The cache is shared between backends? I don't think
> > so, but it shouldn't matter because there has to be a semaphore
> > locking the cache to prevent the coherency issue you describe. If
> > PostgreSQL didn't, it'd be having problems with this now. I'd
> > also think that MVCC would handle the case of updated data in the
> > cache as that has to be a common case. At what point is the
> > cached result invalidated and fetched from the OS?
>
> Uh, it's called the _shared_ buffer cache in postgresql.conf, and we
> lock pages only while we are reading/writing them, not for the duration
> they are in the cache.

*smacks forhead* Duh, you're right. I always just turn up the FS
cache in the OS instead.

The shared buffer cache has got to have enormous churn though if
everything ends up in the userland cache. Is it really an exhaustive
cache? I thought the bulk of the caching happened in the kernel and
not in the userland. Is the userland cache just for the SysCache and
friends, or does it cache everything that moves through PostgreSQL?

> > > The big issue is that when we do a write, we don't wait for it
> > > to get to disk.
> >
> > Only in the case when fsync() is turned off, but again, that's up to
> > the OS to manage that can of worms, which I think BSD takes care of
> > that. From conf/NOTES:
>
> Nope. When you don't have a kernel buffer cache, and you do a
> write, where do you expect it to go? I assume it goes to the drive,
> and you have to wait for that.

Correct, a write call blocks until the bits hit the disk in the
absence of lack of enough buffer space. In the event of enough
buffer, however, the buffer houses the bits until written to disk and
the kernel returns control to the userland app.

Consencus is that FreeBSD does the right thing and hands back data
from the FS buffer even though the fd was marked O_DIRECT (see
bottom).

> > I don't see how this'd be an issue as buffers populated via a
> > read(), that are updated, and then written out, would occupy a new
> > chunk of disk to satisfy MVCC. Why would we need to mark a buffer
> > as read only and carry around/check its state?
>
> We update the expired flags on the tuple during update/delete.

*nods* Okay, I don't see where the problem would be then with
O_DIRECT. I'm going to ask Dillion about O_DIRECT since he
implemented it, likely for the backplane database that he's writing.
I'll let 'ya know what he says.

-sc

Here's a snip from the conv I had with someone that has mega vfs foo
in FreeBSD:

17:58 * seanc has a question about O_DIRECT
17:58 <@zb^3> ask
17:59 <@seanc> assume two procs have a file open, one proc writes using
buffered IO, the other uses O_DIRECT to read from the file, is
read() smart enough to hand back the data in the buffer that
hasn't hit the disk yet or will there be syncing issues?
18:00 <@zb^3> O_DIRECT in the incarnation from matt dillon will break shit
18:00 <@zb^3> basically, any data read will be set non-cacheable
18:01 <@zb^3> and you'll experience writes earlier than you should
18:01 <@seanc> zb^3: hrm, I don't want to write to the fd + O_DIRECT though
18:02 <@seanc> zb^3: basically you're saying an O_DIRECT fd doesn't consult the
FS cache before reading from disk?
18:03 <@zb^3> no, it does
18:03 <@zb^3> but it immediately puts any read blocks on the ass end of the LRU
18:03 <@zb^3> so if you write a block, then read it with O_DIRECT it will get
written out early :(
18:04 <@seanc> zb^3: ah, got it... it's not a data coherency issue, it's a
priority issue and O_DIRECT makes writes jump the gun
18:04 <@seanc> got it
18:05 <@seanc> zb^3: is that required in the implementation or is it a bug?
18:06 * seanc is wondering whether or not he should bug dillion about this to
get things working correctly
18:07 <@zb^3> it's a bug in the implementation
18:08 <@zb^3> to fix it you have to pass flags all the way down into the
getblk-like layer
18:08 <@zb^3> and dillon was opposed to that
18:09 <@seanc> zb^3: hrm, thx... I'll go bug him about it now and see what's up
in backplane land

--
Sean Chittenden

In response to

Re: O_DIRECT in freebsd at 2003-06-23 00:42:45 from Bruce Momjian

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Tom Lane	2003-06-23 02:01:38	Re: Two weeks to feature freeze
Previous Message	Bruce Momjian	2003-06-23 00:42:45	Re: O_DIRECT in freebsd