Re: Block at a time ...

From: Scott Carey <scott(at)richrelevance(dot)com>
To: Craig James <craig_james(at)emolecules(dot)com>
Cc: "pgsql-performance(at)postgresql(dot)org" <pgsql-performance(at)postgresql(dot)org>
Subject: Re: Block at a time ...
Date: 2010-03-27 00:19:12
Message-ID: 302806BE-C518-467F-B627-BA8D29B02297@richrelevance.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance


On Mar 22, 2010, at 4:46 PM, Craig James wrote:

> On 3/22/10 11:47 AM, Scott Carey wrote:
>>
>> On Mar 17, 2010, at 9:41 AM, Craig James wrote:
>>
>>> On 3/17/10 2:52 AM, Greg Stark wrote:
>>>> On Wed, Mar 17, 2010 at 7:32 AM, Pierre C<lists(at)peufeu(dot)com> wrote:
>>>>>> I was thinking in something like that, except that the factor I'd use
>>>>>> would be something like 50% or 100% of current size, capped at (say) 1 GB.
>>>>
>>>> This turns out to be a bad idea. One of the first thing Oracle DBAs
>>>> are told to do is change this default setting to allocate some
>>>> reasonably large fixed size rather than scaling upwards.
>>>>
>>>> This might be mostly due to Oracle's extent-based space management but
>>>> I'm not so sure. Recall that the filesystem is probably doing some
>>>> rounding itself. If you allocate 120kB it's probably allocating 128kB
>>>> itself anyways. Having two layers rounding up will result in odd
>>>> behaviour.
>>>>
>>>> In any case I was planning on doing this a while back. Then I ran some
>>>> experiments and couldn't actually demonstrate any problem. ext2 seems
>>>> to do a perfectly reasonable job of avoiding this problem. All the
>>>> files were mostly large contiguous blocks after running some tests --
>>>> IIRC running pgbench.
>>>
>>> This is one of the more-or-less solved problems in Unix/Linux. Ext* file systems have a "reserve" usually of 10% of the disk space that nobody except root can use. It's not for root, it's because with 10% of the disk free, you can almost always do a decent job of allocating contiguous blocks and get good performance. Unless Postgres has some weird problem that Linux has never seen before (and that wouldn't be unprecedented...), there's probably no need to fool with file-allocation strategies.
>>>
>>> Craig
>>>
>>
>> Its fairly easy to break. Just do a parallel import with say, 16 concurrent tables being written to at once. Result? Fragmented tables.
>
> Is this from real-life experience? With fragmentation, there's a point of diminishing return. A couple head-seeks now and then hardly matter. My recollection is that even when there are lots of concurrent processes running that are all making files larger and larger, the Linux file system still can do a pretty good job of allocating mostly-contiguous space. It doesn't just dumbly allocate from some list, but rather tries to allocate in a way that results in pretty good "contiguousness" (if that's a word).
>
> On the other hand, this is just from reading discussion groups like this one over the last few decades, I haven't tried it...
>

Well how fragmented is too fragmented depends on the use case and the hardware capability. In real world use, which for me means about 20 phases of large bulk inserts a day and not a lot of updates or index maintenance, the system gets somewhat fragmented but its not too bad. I did a dump/restore in 8.4 with parallel restore and it was much slower than usual. I did a single threaded restore and it was much faster. The dev environments are on ext3 and we see this pretty clearly -- but poor OS tuning can mask it (readahead parameter not set high enough). This is CentOS 5.4/5.3, perhaps later kernels are better at scheduling file writes to avoid this. We also use the deadline scheduler which helps a lot on concurrent reads, but might be messing up concurrent writes.
On production with xfs this was also bad at first --- in fact worse because xfs's default 'allocsize' setting is 64k. So files were regularly fragmented in small multiples of 64k. Changing the 'allocsize' parameter to 80MB made the restore process produce files with fragment sizes of 80MB. 80MB is big for most systems, but this array does over 1000MB/sec sequential read at peak, and only 200MB/sec with moderate fragmentation.
It won't fail to allocate disk space due to any 'reservations' of the delayed allocation, it just means that it won't choose to create a new file or extent within 80MB of another file that is open unless it has to. This can cause performance problems if you have lots of small files, which is why the default is 64k.

> Craig

In response to

Browse pgsql-performance by date

  From Date Subject
Next Message Scott Carey 2010-03-27 00:25:34 Re: why does swap not recover?
Previous Message Scott Carey 2010-03-27 00:06:07 Re: pg_dump far too slow