On Tuesday, May 01, 2012 04:08:27 PM Robert Haas wrote:
> We've previously discussed the possible desirability of extending
> relations in larger increments, rather than one block at a time, for
> performance reasons. I attempted to determine how much performance we
> could possibly buy this way, and found that, as far as I can see, the
> answer is, basically, none. I wrote a test program which does writes
> until it reaches 1GB, and times how long the writes take in aggregate.
> Then it performs a single fdatasync at the end and times that as
> well. On some of the machines it is slightly faster in the aggregate
> to extend in larger chunks, but the magnitude of the change is little
> enough that, at least to me, it seems entirely not worth bothering
> with. Some results are below. Now, one thing that this test doesn't
> help much with is the theory that it's better to extend a file in
> larger chunks because the file will become less fragmented on disk. I
> don't really know how to factor that effect into the test - any ideas?
I think that test disregards the fact that were holding an exclusive lock
during the file extension which happens rather frequently if you have multiple
COPYs or similar running at the same time. Extending in bigger chunks reduces
the frequency of taking that lock.
I really would *love* to see improvements in that kind of workload.
> I also considered two other methods of extending a file. First, there
> is ftruncate(). It's really fast. Unfortunately, it's unsuitable for
> our purposes because it will cheerfully leave holes in the file, and
> part of the reason for our current implementation is to make sure that
> there are no holes, so that later writes to the file can't fail for
> lack of disk space. So that's no good. Second, and more
> interestingly, there is a function called posix_fallocate(). It is
> present on Linux but not on MacOS X; I haven't checked any other
> platforms. It claims that it will extend a file out to a particular
> size, forcing disk blocks to be allocated so that later writes won't
> fail. Testing (more details below) shows that posix_fallocate() is
> quite efficient for large chunks. For example, extending a file to
> 1GB in size 64 blocks at a time (that is, 256kB at a time) took only
> ~60 ms and the subsequent fdatasync took almost no time at all,
> whereas zero-filling the file out 1GB using write() took 600-700 ms
> and the subsequent fdatasync took another 4-5 seconds.That seems
> like a pretty sizable win, and it's not too hard to imagine that it
> could be even better when the I/O subsystem is busy. Unfortunately,
> using posix_fallocate() for 8kB chunks seems to be significantly less
> efficient than our current method - I'm guessing that it actually
> writes the updated metadata back to disk, where write() does not (this
> makes one wonder how safe it is to count on write to have the behavior
> we need here in the first place).
Currently the write() doesn't need to be crashsafe because it will be repeated
on crash-recovery and a checkpoint will fsync the file.
I don't really see why it would need to compare in the 8kb case. What reason
would there be to further extend in that small increments?
There is the question whether this should be done in the background though, so
the relation extension lock is never hit in anything time-critical...
In response to
pgsql-hackers by date
|Next:||From: Robert Haas||Date: 2012-05-01 14:56:49|
|Subject: Re: extending relations more efficiently|
|Previous:||From: Robert Haas||Date: 2012-05-01 14:25:53|
|Subject: Re: proposal: additional error fields|