Re: Hardware/OS recommendations for large databases (

From: Alan Stange <stange(at)rentec(dot)com>
To: Luke Lonergan <llonergan(at)greenplum(dot)com>
Cc: Greg Stark <gsstark(at)mit(dot)edu>, Dave Cramer <pg(at)fastcrypt(dot)com>, Joshua Marsh <icub3d(at)gmail(dot)com>, pgsql-performance(at)postgresql(dot)org
Subject: Re: Hardware/OS recommendations for large databases (
Date: 2005-11-20 02:43:48
Message-ID: 437FE2E4.5040600@rentec.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance

Luke Lonergan wrote:
> Alan,
>
> On 11/18/05 11:39 AM, "Alan Stange" <stange(at)rentec(dot)com> wrote:
>
>
>> Yes and no. The one cpu is clearly idle. The second cpu is 40% busy
>> and 60% idle (aka iowait in the above numbers).
>>
>
> The "aka iowait" is the problem here - iowait is not idle (otherwise it
> would be in the "idle" column).
>
> Iowait is time spent waiting on blocking io calls. As another poster
> pointed out, you have a two CPU system, and during your scan, as predicted,
> one CPU went 100% busy on the seq scan. During iowait periods, the CPU can
> be context switched to other users, but as I pointed out earlier, that's not
> useful for getting response on decision support queries.
>
iowait time is idle time. Period. This point has been debated
endlessly for Solaris and other OS's as well.

Here's the man page:
%iowait
Show the percentage of time that the CPU or
CPUs were
idle during which the system had an outstanding
disk I/O
request.

If the system had some other cpu bound work to perform you wouldn't ever
see any iowait time. Anyone claiming the cpu was 100% busy on the
sequential scan using the one set of numbers I posted is
misunderstanding the actual metrics.

> Thanks for your data, it exemplifies many of the points brought up:
> - Lots of disks and expensive I/O hardware does not help improve performance
> on large table queries because I/O bandwidth does not scale beyond
> 110-120MB/s on the fastest CPUs
>
I don't think that is the conclusion from anecdotal numbers I posted.
This file subsystem doesn't perform as well as expected for any tool.
Bonnie, dd, star, etc., don't get a better data rate either. In fact,
the storage system wasn't built for performance; it was build to
reliably hold a big chunk of data. Even so, postgresql is reading at
130MB/s on it, using about 30% of a single cpu, almost all of which was
system time. I would get the same 130MB/s on a system with cpus that
were substantially slower; the limitation isn't the cpus, or
postgresql. It's the IO system that is poorly configured for this test,
not postgresqls ability to use it.

In fact, given the numbers I posted, it's clear this system could
handily generate more than 120 MB/s using a single cpu given a better IO
subsystem; it has cpu time to spare. A simple test can be done:
build the database in /dev/shm and time the scans. It's the same read()
system call being used and now one has made the IO system "infinitely
fast". The claim is being made that standard postgresql is unable to
generate more than 120MB/s of IO on any IO system due to an inefficient
use of the kernel API and excessive memory copies, etc. Having the
database be on a ram based file system is an example of "expensive IO
hardware" and all else would be the same. Hmmm, now that I think about
this, I could throw a medium sized table onto /dev/shm using
tablespaces on one of our 8GB linux boxes. So why is this experiment
not valid, or what is it about the above assertion that I am missing?

Anyway, if one cares about high speed sequential IO, then one should use
a much larger block size to start. Using 8KB IOs is inappropriate for
such a configuration. We happen to be using 32KB blocks on our largest
database and it's been the best move for us.

-- Alan

In response to

Responses

Browse pgsql-performance by date

  From Date Subject
Next Message Alan Stange 2005-11-20 04:43:28 Re: Hardware/OS recommendations for large databases (
Previous Message Mark Kirkwood 2005-11-19 21:28:57 Re: Hardware/OS recommendations for large databases (