Re: fallocate / posix_fallocate for new WAL file creation (etc...)

From: Greg Smith <greg(at)2ndQuadrant(dot)com>
To: Jeff Davis <pgsql(at)j-davis(dot)com>
Cc: Jon Nelson <jnelson+pgsql(at)jamponi(dot)net>, Robert Haas <robertmhaas(at)gmail(dot)com>, Andres Freund <andres(at)2ndquadrant(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: fallocate / posix_fallocate for new WAL file creation (etc...)
Date: 2013-06-30 22:55:39
Message-ID: 51D0B76B.6040706@2ndQuadrant.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 6/30/13 2:01 PM, Jeff Davis wrote:
> Simple test program attached, which creates two files and fills them:
> one by 2048 8KB writes; and another by 1 posix_fallocate of 16MB. Then,
> I just cmp the resulting files (and also "ls" them, to make sure they
> are 16MB).

This makes platform level testing a lot easier, thanks. Attached is an
updated copy of that program with some error checking. If the files it
creates already existed, the code didn't notice, and a series of write
errors happened. If you set the test up right it's not a problem, but
it's better if a bad setup is caught. I wrapped the whole test with a
shell script, also attached, which insures the right test sequence and
checks.

Your C test program compiles and passes on RHEL5/6 here, doesn't on OS X
Darwin. No surprises there, there's a long list of platforms that don't
support this call at
https://www.gnu.org/software/gnulib/manual/html_node/posix_005ffallocate.html
and the Mac is on it. Many other platforms I was worried about don't
support it too--older FreeBSD, HP-UX 11, Solaris 10, mingw, MSVC--so
that cuts down on testing quite a bit. If it runs faster on Linux,
that's the main target here, just like the existing
effective_io_concurrency fadvise code.

The specific thing I was worried about is that this interface might have
a stub that doesn't work perfectly in older Linux kernels. After being
surprised to find this interface worked on RHEL5 with your test program,
I dug into this more. It works there, but it may actually be slower.

posix_fallocate is actually implemented by glibc on Linux. Been there
since 2.1.94 according to the Linux man pages. But Linux itself didn't
add the feature until kernel 2.6.20: http://lwn.net/Articles/226436/
The biggest thing I was worried about--the call might be there in early
kernels but with a non-functional implementation--that's not the case.
Looking at the diff, before that patch there's no fallocate at all.

So what happened in earlier kernels, where there was no kernel level
fallocate available? According to
https://www.redhat.com/archives/fedora-devel-list/2009-April/msg00110.html
what glibc does is check for kernel fallocate(), and if it's not there
it writes a bunch of zeros to create the file instead. What is actually
happening on a RHEL5 system (with kernel 2.6.18) is that calling
posix_fallocate does this fallback behavior, where it basically does the
same thing the existing WAL clearing code does.

I can even prove that's the case. On RHEL5, if you run "strace -o out
./fallocate" the main write loop looks like this:

write(3,
"\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...,
8192) = 8192

But when you call posix_fallocate, you still get a bunch of writes, but
4 bytes at a time:

pwrite(4, "\0", 1, 16769023) = 1
pwrite(4, "\0", 1, 16773119) = 1
pwrite(4, "\0", 1, 16777215) = 1

That's glibc helpfully converting your call to posix_fallocate into
small writes, because the OS doesn't provide a better way in that
kernel. It's not hard to imagine this being slower than what the WAL
code is doing right now. I'm not worried about correctness issues
anymore, but my gut paranoia about this not working as expected on older
systems was justified. Everyone who thought I was just whining owes me
a cookie.

This is what I plan to benchmark specifically next. If the
posix_fallocate approach is actually slower than what's done now when
it's not getting kernel acceleration, which is the case on RHEL5 era
kernels, we might need to make the configure time test more complicated.
Whether posix_fallocate is defined isn't sensitive enough; on Linux it
may be the case that this only is usable when fallocate() is also there.

--
Greg Smith 2ndQuadrant US greg(at)2ndQuadrant(dot)com Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com

Attachment Content-Type Size
testfallocate text/plain 536 bytes
fallocate.c text/x-csrc 544 bytes

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Jon Nelson 2013-06-30 23:41:57 Re: fallocate / posix_fallocate for new WAL file creation (etc...)
Previous Message Nicholas White 2013-06-30 22:45:31 Re: Request for Patch Feedback: Lag & Lead Window Functions Can Ignore Nulls