Re: fallocate / posix_fallocate for new WAL file creation (etc...)

From: Jon Nelson <jnelson+pgsql(at)jamponi(dot)net>
To: Greg Smith <greg(at)2ndquadrant(dot)com>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Andres Freund <andres(at)2ndquadrant(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: fallocate / posix_fallocate for new WAL file creation (etc...)
Date: 2013-07-01 01:28:11
Message-ID: CAKuK5J1AcML-1cGBhnRzED-vh4oG+8HkmFhy2ECa-8JBJ-6qbQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Sun, Jun 30, 2013 at 6:49 PM, Greg Smith <greg(at)2ndquadrant(dot)com> wrote:
> On 5/28/13 10:00 PM, Jon Nelson wrote:
>
>> A note: The attached test program uses *fsync* instead of *fdatasync*
>> after calling fallocate (or writing out 16MB of zeroes), per an
>> earlier suggestion.
>
>
> I tried this out on the RHEL5 platform I'm worried about now. There's
> something weird about the test program there. If I run it once it shows
> posix_fallocate running much faster:
>
> without posix_fallocate: 1 open/close iterations, 1 rewrite in 23.0169s
> with posix_fallocate: 1 open/close iterations, 1 rewrite in 11.1904s

Assuming the platform chosen is using the glibc approach of pwrite(4
bytes) every 4KiB, then the results ought to be similar, and I'm at a
loss to explain why it's performing better (unless - grasping at
straws - simply the *volume* of data transferred from userspace to the
kernel is at play, in which case posix_fallocate will result in 4096
calls to pwrite but at 4 bytes each versus 2048 calls to write at 8KiB
each.) Ultimately the same amount of data gets written to disk (one
would imagine), but otherwise I can't really think of much.

I have also found several errors test_fallocate.c program I posted,
corrected below.
One of them is: it is missing two pairs of parentheses around two #defines:

#define SIXTEENMB 1024*1024*16
#define EIGHTKB 1024*8

should be:

#define SIXTEENMB (1024*1024*16)
#define EIGHTKB (1024*8)

Otherwise the program will end up writing (131072) 8KiB blocks instead of 2048.

This actually makes the comparison between writing 8KiB blocks and
using posix_fallocate favor the latter more strongly in the results
(also seen below).

> The problem is that I'm seeing the gap between the two get smaller the more
> iterations I run, which makes me wonder if the test is completely fair:
>
> without posix_fallocate: 2 open/close iterations, 2 rewrite in 34.3281s
> with posix_fallocate: 2 open/close iterations, 2 rewrite in 23.1798s
>
>
> without posix_fallocate: 3 open/close iterations, 3 rewrite in 44.4791s
> with posix_fallocate: 3 open/close iterations, 3 rewrite in 33.6102s
>
> without posix_fallocate: 5 open/close iterations, 5 rewrite in 65.6244s
> with posix_fallocate: 5 open/close iterations, 5 rewrite in 61.0991s
>
> You didn't show any output from the latest program on your system, so I'm
> not sure how it behaved for you here.

On the the platform I use - openSUSE (12.3, x86_64, kernel 3.9.7
currently) I never see posix_fadvise perform worse. Typically better,
sometimes much better.

To set the number of times the file is overwritten to just 1 (one):

for i in 1 2 5 10 100; do ./test_fallocate foo $i 1; done

I am including a revised version of test_fallocate.c that corrects the
above noted error, one typo (from when I changed fdatasync to fsync)
that did not alter program behavior, corrects a mis-placed
gettimeofday (which does change the results) and includes a new test
that aims (perhaps poorly) to emulate the glibc style of pwrite(4
bytes) for every 4KiB, and tests the resulting file size to make sure
it is 16MiB in size.

The performance of the latter (new) test sometimes seems to perform
worse and sometimes seems to perform better (usually worse) than
either of the other two. In all cases, posix_fallocate performs
better, but I don't have a sufficiently old kernel to test with.

The new results on one machine are below.

With 0 (zero) rewrites (testing *just*
open/some_type_of_allocation/fsync/close):

method: classic. 100 open/close iterations, 0 rewrite in 29.6060s
method: posix_fallocate. 100 open/close iterations, 0 rewrite in 2.1054s
method: glibc emulation. 100 open/close iterations, 0 rewrite in 31.7445s

And with the same number of rewrites as open/close cycles:

method: classic. 1 open/close iterations, 1 rewrite in 0.6297s
method: posix_fallocate. 1 open/close iterations, 1 rewrite in 0.3028s
method: glibc emulation. 1 open/close iterations, 1 rewrite in 0.5521s

method: classic. 2 open/close iterations, 2 rewrite in 1.6455s
method: posix_fallocate. 2 open/close iterations, 2 rewrite in 1.0409s
method: glibc emulation. 2 open/close iterations, 2 rewrite in 1.5604s

method: classic. 5 open/close iterations, 5 rewrite in 7.5916s
method: posix_fallocate. 5 open/close iterations, 5 rewrite in 6.9177s
method: glibc emulation. 5 open/close iterations, 5 rewrite in 8.1137s

method: classic. 10 open/close iterations, 10 rewrite in 29.2816s
method: posix_fallocate. 10 open/close iterations, 10 rewrite in 28.4400s
method: glibc emulation. 10 open/close iterations, 10 rewrite in 31.2693s

--
Jon

Attachment Content-Type Size
test_fallocate.c text/x-csrc 3.5 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Claudio Freire 2013-07-01 01:43:52 Re: plpython implementation
Previous Message ian link 2013-07-01 01:19:02 Re: Review: query result history in psql