Quick Links

adding support for posix_fadvise()

From:	Neil Conway <neilc(at)samurai(dot)com>
To:	PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	adding support for posix_fadvise()
Date:	2003-11-03 06:07:45
Message-ID:	1067839664.3089.173.camel@tokyo
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

A couple days ago, Manfred Spraul mentioned the posix_fadvise() API on
-hackers:

http://www.opengroup.org/onlinepubs/007904975/functions/posix_fadvise.html

I'm working on making use of posix_fadvise() where appropriate. I can
think of the following places where this would be useful:

(1) As Manfred originally noted, when we advance to a new XLOG segment,
we can use POSIX_FADV_DONTNEED to let the kernel know we won't be
accessing the old WAL segment anymore. I've attached a quick kludge of a
patch that implements this. I haven't done any benchmarking of it yet,
though (comments or benchmark results are welcome).

(2) ISTM that we can set POSIX_FADV_RANDOM for *all* indexes, since the
vast majority of the accesses to them shouldn't be sequential. Are there
any situations in which this assumption doesn't hold? (Perhaps B+-tree
bulk loading, or CLUSTER?) Should this be done per-index-AM, or
globally?

(3) When doing VACUUM, ANALYZE, or large sequential scans (for some
reasonable definition of "large"), we can use POSIX_FADV_SEQUENTIAL.

(4) Various other components, such as tuplestore, tuplesort, and any
utility commands that need to scan through an entire user relation for
some reason. Once we've got the APIs for doing this worked out, it
should be relatively easy to add other uses of posix_fadvise().

(5) I'm hesitant to make use of POSIX_FADV_DONTNEED in VACUUM, as has
been suggested elsewhere. The problem is that it's all-or-nothing: if
the VACUUM happens to look at hot pages, these will be flushed from the
page cache, so the net result may be a loss.

So what API is desirable for uses 2-4? I'm thinking of adding a new
function to the smgr API, smgradvise(). Given a Relation and an advice,
this would:

(a) propagate the advice for this relation to all the open FDs for the
relation

(b) store the new advice somewhere so that new FDs for the relation can
have this advice set for them: clients should just be able to call
smgradvise() without needing to worry if someone else has already called
smgropen() for the relation in the past. One problem is how to store
this: I don't think it can be a field of RelationData, since that is
transient. Any suggestions?

Note that I'm assuming that we don't need to set advice on sub-sections
of a relation, although the posix_fadvise() API allows it -- does anyone
think that would be useful?

One potential issue is that when one process calls posix_fadvise() on a
particular FD, I'd expect that other processes accessing the same file
will be affected. For example, enabling FADV_SEQUENTIAL while we're
vacuuming a relation will mean that another client doing a concurrent
SELECT on the relation will see different readahead behavior. That
doesn't seem like a major problem though.

BTW, posix_fadvise() is currently only supported on Linux 2.6 w/ a
recent version of glibc (BSD hackers, if you're listening,
posix_fadvise() would be a very cool thing to have :P). So we'll need to
do the appropriate configure magic to ensure we only use it where its
available. Thankfully, it is a POSIX standard, so I would expect that in
the years to come it will be available on more platforms.

Any comments would be welcome.

-Neil

Responses

Re: adding support for posix_fadvise() at 2003-11-03 06:15:30 from Neil Conway
Re: adding support for posix_fadvise() at 2003-11-03 09:21:36 from Hannu Krosing
Re: adding support for posix_fadvise() at 2003-11-03 14:38:23 from Tom Lane

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Larry Rosenman	2003-11-03 06:14:15	Re: 7.4RC1 tag'd, branched and bundled ...
Previous Message	Marc G. Fournier	2003-11-03 05:38:54	7.4RC1 tag'd, branched and bundled ...