Skip site navigation (1) Skip section navigation (2)

Re: Hardware/OS recommendations for large databases (

From: Alan Stange <stange(at)rentec(dot)com>
To: Luke Lonergan <llonergan(at)greenplum(dot)com>
Cc: Greg Stark <gsstark(at)mit(dot)edu>, Dave Cramer <pg(at)fastcrypt(dot)com>, Joshua Marsh <icub3d(at)gmail(dot)com>, pgsql-performance(at)postgresql(dot)org
Subject: Re: Hardware/OS recommendations for large databases (
Date: 2005-11-21 21:53:41
Message-ID: 438241E5.2010701@rentec.com (view raw or flat)
Thread:
Lists: pgsql-performance
Luke,

it's time to back yourself up with some numbers.   You're claiming the 
need for a significant rewrite of portions of postgresql and you haven't 
done the work to make that case. 

You've apparently made some mistakes on the use of dd to benchmark a 
storage system.   Use lmdd and umount the file system before the read 
and post your results.  Using a file 2x the size of memory doesn't work 
corectly.  You can quote any other numbers you want, but until you use 
lmdd correctly you should be ignored.  Ideally, since postgresql uses 
1GB files, you'll want to use 1GB files for dd as well.

Luke Lonergan wrote:
> Alan,
>
> On 11/21/05 6:57 AM, "Alan Stange" <stange(at)rentec(dot)com> wrote:
>
>   
>> $ time dd if=/dev/zero of=/fidb1/bigfile bs=8k count=800000
>> 800000+0 records in
>> 800000+0 records out
>>
>> real    0m13.780s
>> user    0m0.134s
>> sys     0m13.510s
>>
>> Oops.   I just wrote 470MB/s to a file system that has peak write speed
>> of 200MB/s peak.
>>     
> How much RAM on this machine?
>   
Doesn't matter.  The result will always be wrong without a call to 
sync() or fsync() before the close() if you're trying to measure the 
speed of the disk subsystem.   Add that sync() and the result will be 
correct for any memory size.  Just for completeness:  Solaris implicitly 
calls sync() as part of close.   Bonnie used to get this wrong, so 
quoting Bonnie isn't any good.   Note that on some systems using 2x 
memory for these tests is almost OK.  For example, Solaris used to have 
a hiwater mark that would throttle processes and not allow more than a 
few 100K of  writes to be outstanding on a file.  Linux/XFS clearly 
allows a lot of write data to be outstanding.  It's best to understand 
the tools and know what they do and why they can be wrong than simply 
quoting some other tool that makes the same mistakes.

I find that postgresql is able to achieve about 175MB/s on average from 
a system capable of delivering 200MB/s peak and it does this with a lot 
of cpu time to spare.   Maybe dd can do a little better and deliver 
185MB/s.    If I were to double the speed of my IO system, I might find 
that a single postgresql instance can sink about 300MB/s of data (based 
on the last numbers I posted).  That's why I have multi-cpu opterons and 
more than one query/client as they soak up the remaining IO capacity.

It is guaranteed that postgresql will hit some threshold of performance 
in the future and possible rewrites of some core functionality will be 
needed, but no numbers posted here so far have made the case that 
postgresql is in trouble now.     In the mean time, build balanced 
systems with cpus that match the capabilities of the storage subsystems, 
use 32KB block sizes for large memory databases that are doing lots of 
sequential scans, use file systems tuned for large files, use opterons, etc.


As always, one has to post some numbers.   Here's an example of how dd 
doesn't do what you might expect:

mite02:~ # lmdd  if=internal of=/fidb2/bigfile bs=8k count=2k
16.7772 MB in 0.0235 secs, 714.5931 MB/sec

mite02:~ # lmdd  if=internal of=/fidb2/bigfile bs=8k count=2k sync=1
16.7772 MB in 0.1410 secs, 118.9696 MB/sec

Both numbers are "correct".  But one measures the kernels ability to 
absorb 2000 8KB writes with no guarantee that the data is on disk and 
the second measures the disk subsystems ability to write 16MB of data.  
dd is equivalent to the first result.  You can't use the first type of 
result and complain that postgresql is slow.  If you wrote 16G of data 
on a machine with 8G memory then your dd result is possibly too fast by 
a factor of two as 8G of the data might not be on disk yet.  We won't 
know until you post some results.

Cheers,

-- Alan


In response to

Responses

pgsql-performance by date

Next:From: Luke LonerganDate: 2005-11-21 23:44:35
Subject: Re: Hardware/OS recommendations for large databases (
Previous:From: Michael StoneDate: 2005-11-21 19:59:09
Subject: Re: Hardware/OS recommendations for large databases (

Privacy Policy | About PostgreSQL
Copyright © 1996-2014 The PostgreSQL Global Development Group