Re: Hardware/OS recommendations for large databases (

From: "Luke Lonergan" <llonergan(at)greenplum(dot)com>
To: stange(at)rentec(dot)com
Cc: "Greg Stark" <gsstark(at)mit(dot)edu>, "Dave Cramer" <pg(at)fastcrypt(dot)com>, "Joshua Marsh" <icub3d(at)gmail(dot)com>, pgsql-performance(at)postgresql(dot)org
Subject: Re: Hardware/OS recommendations for large databases (
Date: 2005-11-21 18:06:48
Message-ID: BFA74CB8.142B5%llonergan@greenplum.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance

Alan,

On 11/21/05 6:57 AM, "Alan Stange" <stange(at)rentec(dot)com> wrote:

> $ time dd if=/dev/zero of=/fidb1/bigfile bs=8k count=800000
> 800000+0 records in
> 800000+0 records out
>
> real 0m13.780s
> user 0m0.134s
> sys 0m13.510s
>
> Oops. I just wrote 470MB/s to a file system that has peak write speed
> of 200MB/s peak.

How much RAM on this machine?

> Now, you might say that you wrote a 16GB file on an 8 GB machine so this
> isn't an issue. It does make your dd numbers look fast as some of the
> data will be unwritten.

This simple test, at 2x memory correlates very closely to Bonnie++ numbers
for sequential scan. What's more, we see close to the same peak in practice
with multiple scanners. Furthermore, if you run two of them simultaneously
(on two filesystems), you can also see the I/O limited.

> I'd also suggest running dd on the same files as postgresql. I suspect
> you'd find that the layout of the postgresql files isn't that good as
> they are grown bit by bit, unlike the file created by simply dd'ing a
> large file.

Can happen if you're not careful with filesystems (see above).

There's nothing "wrong" with the dd test.

> I think your point doesn't hold up. Every time you make it, I come away
> posting another result showing it to be incorrect.

Prove it - your Reiserfs number was about the same.

I also posted an XFS number that was substantially higher than 110-120.

> The point your making doesn't match my experience with *any* storage or
> program I've ever used, including postgresql. Your point suggests that
> the storage system is idle and that postgresql is broken because it
> isn't able to use the resources available...even when the cpu is very
> idle. How can that make sense? The issue here is that the storage
> system is very active doing reads on the files...which might be somewhat
> poorly allocated on disk because postgresql grows the tables bit by bit.

Then you've made my point - if the problem is contiguity of files on disk,
then larger allocation blocks would help on the CPU side.

The objective is clear: given a high performance filesystem, how much of the
available bandwidth can Postgres achieve? I think what we're seeing is that
XFS is dramatically improving that objective.

> I had the same readahead in Reiser and in XFS. The XFS performance was
> better because XFS does a better job of large file allocation on disk,
> thus resulting in many fewer seeks (generated by the file system itself)
> to read the files back in. As an example, some file systems like UFS
> purposely scatter large files across cylinder groups to avoid forcing
> large seeks on small files; one can tune this behavior so that large
> files are more tightly allocated.

Our other tests have used ext3, reiser and Solaris 10 UFS, so this might
make some sense.

> Of course, because this is engineering, I have another obligatory data
> point: This time it's a 4.2GB table using 137,138 32KB pages with
> nearly 41 million rows.
>
> A "select count(1)" on the table completes in 14.6 seconds, for an
> average read rate of 320 MB/s.

So, assuming that the net memory scan rate is about 2GB/s, and two copies
(one from FS cache to buffer cache, one from buffer cache to the agg node),
you have a 700MB/s filesystem with the equivalent of DirectIO (no FS cache)
because you are reading directly from the I/O cache. You got half of that
because the I/O processing in the executor is limited to 320MB/s on that
fast CPU.

My point is this: if you were to decrease the filesystem speed to say
400MB/s and still use the equivalent of DirectIO, I thinkPostgres would not
deliver 320MB/s, but rather something like 220MB/s due to the
producer/consumer arch of the executor. If you get that part, then we're on
the same track, otherwise we disagree.

> One cpu was idle, the other averaged 32% system time and 68 user time
> for the 14 second period. This is on a 2.2Ghz Opteron. A faster cpu
> would show increased performance as I really am cpu bound finally.

Yep, with the equivalent of DirectIO you are.

- Luke

In response to

Responses

Browse pgsql-performance by date

  From Date Subject
Next Message Luke Lonergan 2005-11-21 18:14:29 Re: Hardware/OS recommendations for large databases (
Previous Message Scott Marlowe 2005-11-21 16:43:38 Re: VERY slow after many updates