Re: pgsql-server/ /configure /configure.in rc/incl ...

From: Sean Chittenden <sean(at)chittenden(dot)org>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Christopher Kings-Lynne <chriskl(at)familyhealth(dot)com(dot)au>, pgsql-committers(at)postgresql(dot)org, pgsql-performance(at)postgresql(dot)org
Subject: Re: pgsql-server/ /configure /configure.in rc/incl ...
Date: 2003-03-07 00:36:40
Message-ID: 20030307003640.GF79234@perrin.int.nxad.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-committers pgsql-performance

[moving to -performance, please drop -committers from replies]

> > I've toyed with the idea of adding this because it is monstrously more
> > efficient than select()/poll() in basically every way, shape, and
> > form.
>
> From what I've looked at, kqueue only wins when you are watching a
> large number of file descriptors at the same time; which is an
> operation done nowhere in Postgres. I think the above would be a
> complete waste of effort.

It scales very well to many thousands of descriptors, but it also
works well on small numbers as well. kqueue is about 5x faster than
select() or poll() on the low end of number of fd's. As I said
earlier, I don't think there is _much_ to gain in this regard, but I
do think that it would be a speed improvement but only to one OS
supported by PostgreSQL. I think that there are bigger speed
improvements to be had elsewhere in the code.

> > Is this one of the areas of PostgreSQL that just needs to get
> > slowly migrated to use mmap() or are there any gaping reasons why
> > to not use the family of system calls?
>
> There has been much speculation on this, and no proof that it
> actually buys us anything to justify the portability hit.

Actually, I think that it wouldn't be that big of a portability hit
because you still would read() and write() as always, but in
performance sensitive areas, an #ifdef HAVE_MMAP section would have
the appropriate mmap() calls. If the system doesn't have mmap(),
there isn't much to loose and we're in the same position we're in now.

> There would be some nontrivial problems to solve, such as the
> mechanics of accessing a large number of files from a large number
> of backends without running out of virtual memory. Also, is it
> guaranteed that multiple backends mmap'ing the same block will
> access the very same physical buffer, and not multiple copies?
> Multiple copies would be fatal. See the acrhives for more
> discussion.

Have read through the archives. Making a call to madvise() will speed
up access to the pages as it gives hints to the VM about what order
the pages are accessed/used. Here are a few bits from the BSD mmap()
and madvise() man pages:

mmap(2):
MAP_NOSYNC Causes data dirtied via this VM map to be flushed to
physical media only when necessary (usually by the
pager) rather then gratuitously. Typically this pre-
vents the update daemons from flushing pages dirtied
through such maps and thus allows efficient sharing of
memory across unassociated processes using a file-
backed shared memory map. Without this option any VM
pages you dirty may be flushed to disk every so often
(every 30-60 seconds usually) which can create perfor-
mance problems if you do not need that to occur (such
as when you are using shared file-backed mmap regions
for IPC purposes). Note that VM/filesystem coherency
is maintained whether you use MAP_NOSYNC or not. This
option is not portable across UNIX platforms (yet),
though some may implement the same behavior by default.

WARNING! Extending a file with ftruncate(2), thus cre-
ating a big hole, and then filling the hole by modify-
ing a shared mmap() can lead to severe file fragmenta-
tion. In order to avoid such fragmentation you should
always pre-allocate the file's backing store by
write()ing zero's into the newly extended area prior to
modifying the area via your mmap(). The fragmentation
problem is especially sensitive to MAP_NOSYNC pages,
because pages may be flushed to disk in a totally ran-
dom order.

The same applies when using MAP_NOSYNC to implement a
file-based shared memory store. It is recommended that
you create the backing store by write()ing zero's to
the backing file rather then ftruncate()ing it. You
can test file fragmentation by observing the KB/t
(kilobytes per transfer) results from an ``iostat 1''
while reading a large file sequentially, e.g. using
``dd if=filename of=/dev/null bs=32k''.

The fsync(2) function will flush all dirty data and
metadata associated with a file, including dirty NOSYNC
VM data, to physical media. The sync(8) command and
sync(2) system call generally do not flush dirty NOSYNC
VM data. The msync(2) system call is obsolete since
BSD implements a coherent filesystem buffer cache.
However, it may be used to associate dirty VM pages
with filesystem buffers and thus cause them to be
flushed to physical media sooner rather then later.

madvise(2):
MADV_NORMAL Tells the system to revert to the default paging behav-
ior.

MADV_RANDOM Is a hint that pages will be accessed randomly, and
prefetching is likely not advantageous.

MADV_SEQUENTIAL Causes the VM system to depress the priority of pages
immediately preceding a given page when it is faulted
in.

mprotect(2):
The mprotect() system call changes the specified pages to have protection
prot. Not all implementations will guarantee protection on a page basis;
the granularity of protection changes may be as large as an entire
region. A region is the virtual address space defined by the start and
end addresses of a struct vm_map_entry.

Currently these protection bits are known, which can be combined, OR'd
together:

PROT_NONE No permissions at all.

PROT_READ The pages can be read.

PROT_WRITE The pages can be written.

PROT_EXEC The pages can be executed.

msync(2):
The msync() system call writes any modified pages back to the filesystem
and updates the file modification time. If len is 0, all modified pages
within the region containing addr will be flushed; if len is non-zero,
only those pages containing addr and len-1 succeeding locations will be
examined. The flags argument may be specified as follows:

MS_ASYNC Return immediately
MS_SYNC Perform synchronous writes
MS_INVALIDATE Invalidate all cached data

A few thoughts come to mind:

1) backends could share buffers by mmap()'ing shared regions of data.
While I haven't seen any numbers to reflect this, I'd wager that
mmap() is a faster interface than ipc.

2) It looks like while there are various file IO schemes scattered all
over the place, the bulk of the critical routines that would need
to be updated are in backend/storage/file/fd.c, more specifically:

*) fileNameOpenFile() would need the appropriate mmap() call made
to it.

*) FileTruncate() would need some attention to avoid fragmentation.

*) a new "sync" GUC would have to be introduced to handle msync
(affects only pg_fsync() and pg_fdatasync()).

3) There's a bit of code in pgsql/src/backend/storage/smgr that could
be gutted/removed. Which of those storage types are even used any
more? There's a reference in the code to PostgreSQL 3.0. :)

And I think that'd be it. The LRU code could be used if necessary to
help manage the amount of mmap()'ed in the VM at any one time, at the
very least that could be a handled by a shm var that various backends
would increment/decrement as files are open()'ed/close()'ed.

I didn't spend too long looking at this, but I _think_ that'd cover
80% of PostgreSQL's disk access needs. The next bit to possibly add
would be passing a flag on FileOpen operations that'd act as a hint to
madvise() that way the VM could proactively react to PostgreSQL's
needs.

I don't have my copy of Steven's handy (it's some 700mi away atm
otherwise I'd cite it), but if Tom or someone else has it handy, look
up the example re: the performance gain from read()'ing an mmap()'ed
file versus a non-mmap()'ed file. The difference is non-trivial and
_WELL_ worth the time given the speed increase. The same speed
benefit held true for writes as well, iirc. It's been a while, but I
think it was around page 330. The index has it listed and it's not
that hard of an example to find. -sc

--
Sean Chittenden

In response to

Responses

Browse pgsql-committers by date

  From Date Subject
Next Message Neil Conway 2003-03-07 00:47:52 Re: [COMMITTERS] pgsql-server/ /configure /configure.in rc/incl ...
Previous Message Tom Lane 2003-03-06 22:55:03 pgsql-server/src backend/catalog/Tag: backend/ ...

Browse pgsql-performance by date

  From Date Subject
Next Message Neil Conway 2003-03-07 00:47:52 Re: [COMMITTERS] pgsql-server/ /configure /configure.in rc/incl ...
Previous Message Rod Taylor 2003-03-06 23:35:56 Re: Write ahead logging