Linux 2.6.6 also

From: Gregory Stark <gsstark(at)mit(dot)edu>
To: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Linux 2.6.6 also
Date: 2004-05-10 15:26:41
Message-ID: 874qqos3pq.fsf@stark.xeocode.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers


This patch also looks relevant to Postgres for two reasons.

This part seems like it might expose some bugs that otherwise might have
remained hidden:

This affects I/O scheduling potentially quite significantly. It is no
longer the case that the kernel will submit pages for I/O in the order in
which the application dirtied them. We instead submit them in file-offset
order all the time.

The part about part-file fdatasync calls seems like could be really useful.
It seems like that's just speculation about future directions though?

<akpm(at)osdl(dot)org>
[PATCH] make the pagecache lock irq-safe.

Intro to these patches:

- Major surgery against the pagecache, radix-tree and writeback code. This
work is to address the O_DIRECT-vs-buffered data exposure horrors which
we've been struggling with for months.

As a side-effect, 32 bytes are saved from struct inode and eight bytes
are removed from struct page. At a cost of approximately 2.5 bits per page
in the radix tree nodes on 4k pagesize, assuming the pagecache is densely
populated. Not all pages are pagecache; other pages gain the full 8 byte
saving.

This change will break any arch code which is using page->list and will
also break any arch code which is using page->lru of memory which was
obtained from slab.

The basic problem which we (mainly Daniel McNeil) have been struggling
with is in getting a really reliable fsync() across the page lists while
other processes are performing writeback against the same file. It's like
juggling four bars of wet soap with your eyes shut while someone is
whacking you with a baseball bat. Daniel pretty much has the problem
plugged but I suspect that's just because we don't have testcases to
trigger the remaining problems. The complexity and additional locking
which those patches add is worrisome.

So the approach taken here is to remove the page lists altogether and
replace the list-based writeback and wait operations with in-order
radix-tree walks.

The radix-tree code has been enhanced to support "tagging" of pages, for
later searches for pages which have a particular tag set. This means that
we can ask the radix tree code "find me the next 16 dirty pages starting at
pagecache index N" and it will do that in O(log64(N)) time.

This affects I/O scheduling potentially quite significantly. It is no
longer the case that the kernel will submit pages for I/O in the order in
which the application dirtied them. We instead submit them in file-offset
order all the time.

This is likely to be advantageous when applications are seeking all over
a large file randomly writing small amounts of data. I haven't performed
much benchmarking, but tiobench random write throughput seems to be
increased by 30%. Other tests appear to be unaltered. dbench may have got
10-20% quicker, but it's variable.

There is one large file which everyone seeks all over randomly writing
small amounts of data: the blockdev mapping which caches filesystem
metadata. The kernel's IO submission patterns for this are now ideal.


Because writeback and wait-for-writeback use a tree walk instead of a
list walk they are no longer livelockable. This probably means that we no
longer need to hold i_sem across O_SYNC writes and perhaps fsync() and
fdatasync(). This may be beneficial for databases: multiple processes
writing and syncing different parts of the same file at the same time can
now all submit and wait upon writes to just their own little bit of the
file, so we can get a lot more data into the queues.

It is trivial to implement a part-file-fdatasync() as well, so
applications can say "sync the file from byte N to byte M", and multiple
applications can do this concurrently. This is easy for ext2 filesystems,
but probably needs lots of work for data-journalled filesystems and XFS and
it probably doesn't offer much benefit over an i_semless O_SYNC write.


These patches can end up making ext3 (even) slower:

for i in 1 2 3 4
do
dd if=/dev/zero of=$i bs=1M count=2000 &
done

runs awfully slow on SMP. This is, yet again, because all the file
blocks are jumbled up and the per-file linear writeout causes tons of
seeking. The above test runs sweetly on UP because the on UP we don't
allocate blocks to different files in parallel.

Mingming and Badari are working on getting block reservation working for
ext3 (preallocation on steroids). That should fix ext3 up.


This patch:

- Later, we'll need to access the radix trees from inside disk I/O
completion handlers. So make mapping->page_lock irq-safe. And rename it
to tree_lock to reliably break any missed conversions.

--
greg

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Andrew Sullivan 2004-05-10 15:59:40 Re: signal 11 on AIX: 7.4.2
Previous Message Gregory Stark 2004-05-10 14:58:13 Linux 2.6.6 changes