On Friday 23 March 2007 14:32, Matt Smiley wrote:
> Thanks Dimitri! That was very educational material! I'm going to think
> out loud here, so please correct me if you see any errors.
Your mail is so long - I was unable to answer all questions same day :))
> The section on tuning for OLTP transactions was interesting, although my
> OLAP workload will be predominantly bulk I/O over large datasets of
> mostly-sequential blocks.
I supposed mostly READ operations, right?
> The NFS+ZFS section talked about the zil_disable control for making zfs
> ignore commits/fsyncs. Given that Postgres' executor does single-threaded
> synchronous I/O like the tar example, it seems like it might benefit
> significantly from setting zil_disable=1, at least in the case of
> frequently flushed/committed writes. However, zil_disable=1 sounds unsafe
> for the datafiles' filesystem, and would probably only be acceptible for
> the xlogs if they're stored on a separate filesystem and you're willing to
> loose recently committed transactions. This sounds pretty similar to just
> setting fsync=off in postgresql.conf, which is easier to change later, so
> I'll skip the zil_disable control.
yes, you don't need it for PostgreSQL, it may be useful for other database
vendors, but not here.
> The RAID-Z section was a little surprising. It made RAID-Z sound just like
> RAID 50, in that you can customize the trade-off between iops versus usable
> diskspace and fault-tolerance by adjusting the number/size of
> parity-protected disk groups. The only difference I noticed was that
> RAID-Z will apparently set the stripe size across vdevs (RAID-5s) to be as
> close as possible to the filesystem's block size, to maximize the number of
> disks involved in concurrently fetching each block. Does that sound about
Well, look at RAID-Z just as wide RAID solution. More you have disks in your
system - more high is probability you may loose 2 disks on the same time, and
in this case wide RAID-10 will simply make loose you whole the data set (and
again if you loose both disks in mirror pair). So, RAID-Z brings you more
security as you may use wider parity, but the price for it is I/O
> So now I'm wondering what RAID-Z offers that RAID-50 doesn't. I came up
> with 2 things: an alleged affinity for full-stripe writes and (under
> RAID-Z2) the added fault-tolerance of RAID-6's 2nd parity bit (allowing 2
> disks to fail per zpool). It wasn't mentioned in this blog, but I've heard
> that under certain circumstances, RAID-Z will magically decide to mirror a
> block instead of calculating parity on it. I'm not sure how this would
> happen, and I don't know the circumstances that would trigger this
> behavior, but I think the goal (if it really happens) is to avoid the
> performance penalty of having to read the rest of the stripe required to
> calculate parity. As far as I know, this is only an issue affecting small
> writes (e.g. single-row updates in an OLTP workload), but not large writes
> (compared to the RAID's stripe size). Anyway, when I saw the filesystem's
> intent log mentioned, I thought maybe the small writes are converted to
> full-stripe writes by deferring their commit until a full stripe's worth of
> data had been accumulated. Does that sound plausible?
The problem here that within the same workload you're able to do less I/O
operations with RAID-Z then in RAID-10. So, bigger your I/O block size or
smaller - you'll still obtain lower throughput, no? :)
> Are there any other noteworthy perks to RAID-Z, rather than RAID-50? If
> not, I'm inclined to go with your suggestion, Dimitri, and use zfs like
> RAID-10 to stripe a zpool over a bunch of RAID-1 vdevs. Even though many
> of our queries do mostly sequential I/O, getting higher seeks/second is
> more important to us than the sacrificed diskspace.
There is still one point to check: if you do mostly READ on your database
probably RAID-Z will be not *too* bad and will give you more used space.
However, if you need to update your data or load frequently - RAID-10 will be
> For the record, those blogs also included a link to a very helpful ZFS Best
> Practices Guide:
oh yes, it's constantly growing wiki, good start for any Solaris questions as
well performance points :)
> To sum up, so far the short list of tuning suggestions for ZFS includes:
> - Use a separate zpool and filesystem for xlogs if your apps write often.
> - Consider setting zil_disable=1 on the xlogs' dedicated filesystem. ZIL
> is the intent log, and it sounds like disabling it may be like disabling
> journaling. Previous message threads in the Postgres archives debate
> whether this is safe for the xlogs, but it didn't seem like a conclusive
> answer was reached. - Make filesystem block size (zfs record size) match
> the Postgres block size. - Manually adjust vdev_cache. I think this sets
> the read-ahead size. It defaults to 64 KB. For OLTP workload, reduce it;
> for DW/OLAP maybe increase it. - Test various settings for vq_max_pending
> (until zfs can auto-tune it). See
> http://blogs.sun.com/erickustarz/entry/vq_max_pending - A zpool of mirrored
> disks should support more seeks/second than RAID-Z, just like RAID 10 vs.
> RAID 50. However, no single Postgres backend will see better than a single
> disk's seek rate, because the executor currently dispatches only 1 logical
> I/O request at a time.
I'm currently just doing OLTP benchmark on ZFS and quite surprising it's
really *doing* several concurrent I/O operations on multi-user workload! :)
Even vacuum seems to run much more faster (or probably it's just my
But keep in mind - ZFS is a very young file systems and doing only its first
steps in database workload. So, current goal here is to bring ZFS performance
at least at the same level as UFS is reaching in the same conditions...
Positive news: PostgreSQL seems to me performing much more better than other
database vendors (currently I'm getting at least 80% of UFS performance)...
All tuning points already mentioned previously by you are correct, and I
promise you to publish all other details/findings once I've finished my
tests! (it's too early to get conclusions yet :))
> >>> Dimitri <dimitrik(dot)fr(at)gmail(dot)com> 03/23/07 2:28 AM >>>
> On Friday 23 March 2007 03:20, Matt Smiley wrote:
> > My company is purchasing a Sunfire x4500 to run our most I/O-bound
> > databases, and I'd like to get some advice on configuration and tuning.
> > We're currently looking at: - Solaris 10 + zfs + RAID Z
> > - CentOS 4 + xfs + RAID 10
> > - CentOS 4 + ext3 + RAID 10
> > but we're open to other suggestions.
> for Solaris + ZFS you may find answers to all your questions here:
> Think to measure log (WAL) activity and use separated pool for logs if
> needed. Also, RAID-Z is more security-oriented rather performance, RAID-10
> should be a better choice...
In response to
pgsql-performance by date
|Next:||From: Gauri Kanekar||Date: 2007-03-26 12:04:39|
|Subject: Nested Loop|
|Previous:||From: Stefan Kaltenbrunner||Date: 2007-03-25 08:12:16|
|Subject: Re: OT: Munin (was Re: Determining server load from client)|