Re: Hardware/OS recommendations for large databases (

From: Alan Stange <stange(at)rentec(dot)com>
To: Luke Lonergan <llonergan(at)greenplum(dot)com>
Cc: Greg Stark <gsstark(at)mit(dot)edu>, Dave Cramer <pg(at)fastcrypt(dot)com>, Joshua Marsh <icub3d(at)gmail(dot)com>, pgsql-performance(at)postgresql(dot)org
Subject: Re: Hardware/OS recommendations for large databases (
Date: 2005-11-22 14:26:38
Message-ID: 43832A9E.3030801@rentec.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance

Luke,

- XFS will probably generate better data rates with larger files. You
really need to use the same file size as does postgresql. Why compare
the speed to reading a 16G file and the speed to reading a 1G file.
They won't be the same. If need be, write some code that does the test
or modify lmdd to read a sequence of 1G files. Will this make a
difference? You don't know until you do it. Any time you cross a
couple of 2^ powers in computing, you should expect some differences.

- you did umount the file system before reading the 16G file back in?
Because if you didn't then your read numbers are possibly garbage.
When the read began, 8G of the file was in memory. You'd be very naive
to think that somehow the read of the first 8GB somehow flushed that
cached data out of memory. After all, why would the kernel flush pages
from file X when you're in the middle of a sequential read of...file
X? I'm not sure how Linux handles this, but Solaris would've found the
8G still in memory.

- What was the hardware and disk configuration on which these numbers
were generated? For example, if you have a U320 controller, how did
the read rate become larger than 320MB/s?

- how did the results change from before? Just posting the new results
is misleading given all the boasting we've had to read about your past
results.

- there are two results below for writing to ext2: one at 209 MB/s and
one at 113MB/s. Why are they different?

- what was the cpu usage during these tests? We see postgresql doing
200+MB/s of IO. You've claimed many times that the machine would be
compute bound at lower IO rates, so how much idle time does the cpu
still have?

- You wrote: "We'll do a 16GB table size to ensure that we aren't
reading from the read cache. " Do you really believe that?? You have
to umount the file system before each test to ensure you're really
measuring the disk IO rate. If I'm reading your results correctly, it
looks like you have three results for ext and xfs, each of which is
faster than the prior one. If I'm reading this correctly, then it looks
like one is clearly reading from the read cache.

- Gee, it's so nice of you to drop your 120MB/s observation. I guess my
reading at 300MB/s wasn't convincing enough. Yeah, I think it was the
cpus too...

- I wouldn't focus on the flat 64% of the data rate number. It'll
probably be different on other systems.

I'm all for testing and testing. It seems you still cut a corner
without umounting the file system first. Maybe I'm a little too old
school on this, but I wouldn't spend a dime until you've done the
measurements correctly.

Good Luck.

-- Alan

Luke Lonergan wrote:
> Alan,
>
> Looks like Postgres gets sensible scan rate scaling as the filesystem speed
> increases, as shown below. I'll drop my 120MB/s observation - perhaps CPUs
> got faster since I last tested this.
>
> The scaling looks like 64% of the I/O subsystem speed is available to the
> executor - so as the I/O subsystem increases in scan rate, so does Postgres'
> executor scan speed.
>
> So that leaves the question - why not more than 64% of the I/O scan rate?
> And why is it a flat 64% as the I/O subsystem increases in speed from
> 333-400MB/s?
>
> - Luke
>
> ================= Results ===================
>
> Unless noted otherwise all results posted are for block device readahead set
> to 16M using "blockdev --setra=16384 <block_device>". All are using the
> 2.6.9-11 Centos 4.1 kernel.
>
> For those who don't have lmdd, here is a comparison of two results on an
> ext2 filesystem:
>
> ============================================================================
> [root(at)modena1 dbfast1]# time bash -c "(dd if=/dev/zero of=/dbfast1/bigfile
> bs=8k count=800000 && sync)"
> 800000+0 records in
> 800000+0 records out
>
> real 0m33.057s
> user 0m0.116s
> sys 0m13.577s
>
> [root(at)modena1 dbfast1]# time lmdd if=/dev/zero of=/dbfast1/bigfile bs=8k
> count=800000 sync=1
> 6553.6000 MB in 31.2957 secs, 209.4092 MB/sec
>
> real 0m33.032s
> user 0m0.087s
> sys 0m13.129s
> ============================================================================
>
> So lmdd with sync=1 is equivalent to a sync after a dd.
>
> I use 2x memory with dd for the *READ* performance testing, but let's make
> sure things are synced on both write and read for this set of comparisons.
>
> First, let's test ext2 versus "ext3, data=ordered", versus xfs:
>
> ============================================================================
> 16GB write, then read
> ============================================================================
> -----------------------
> ext2:
> -----------------------
> [root(at)modena1 dbfast1]# time lmdd if=/dev/zero of=/dbfast1/bigfile bs=8k
> count=2000000 sync=1
> 16384.0000 MB in 144.2670 secs, 113.5672 MB/sec
>
> [root(at)modena1 dbfast1]# time lmdd if=/dbfast1/bigfile of=/dev/null bs=8k
> count=2000000 sync=1
> 16384.0000 MB in 49.3766 secs, 331.8170 MB/sec
>
> -----------------------
> ext3, data=ordered:
> -----------------------
> [root(at)modena1 ~]# time lmdd if=/dev/zero of=/dbfast1/bigfile bs=8k
> count=2000000 sync=1
> 16384.0000 MB in 137.1607 secs, 119.4511 MB/sec
>
> [root(at)modena1 ~]# time lmdd if=/dbfast1/bigfile of=/dev/null bs=8k
> count=2000000 sync=1
> 16384.0000 MB in 48.7398 secs, 336.1527 MB/sec
>
> -----------------------
> xfs:
> -----------------------
> [root(at)modena1 ~]# time lmdd if=/dev/zero of=/dbfast1/bigfile bs=8k
> count=2000000 sync=1
> 16384.0000 MB in 52.6141 secs, 311.3994 MB/sec
>
> [root(at)modena1 ~]# time lmdd if=/dbfast1/bigfile of=/dev/null bs=8k
> count=2000000 sync=1
> 16384.0000 MB in 40.2807 secs, 406.7453 MB/sec
> ============================================================================
>
> I'm liking xfs! Something about the way files are layed out, as Alan
> suggested seems to dramatically improve write performance and perhaps
> consequently the read also improves. There doesn't seem to be a difference
> between ext3 and ext2, as expected.
>
> Now on to the Postgres 8 tests. We'll do a 16GB table size to ensure that
> we aren't reading from the read cache. I'll write this file through
> Postgres COPY to be sure that the file layout is as Postgres creates it. The
> alternative would be to use COPY once, then tar/untar onto different
> filesystems, but that may not duplicate the real world results.
>
> These tests will use Bizgres 0_8_1, which is an augmented 8.0.3. None of
> the augmentations act to improve the executor I/O though, so for these
> purposes it should be the same as 8.0.3.
>
> ============================================================================
> 26GB of DBT-3 data from the lineitem table
> ============================================================================
> llonergan=# select relpages from pg_class where relname='lineitem';
> relpages
> ----------
> 3159138
> (1 row)
>
> 3159138*8192/1000000
> 25879 Million Bytes, or 25.9GB
>
> -----------------------
> xfs:
> -----------------------
> llonergan=# \timing
> Timing is on.
> llonergan=# select count(1) from lineitem;
> count
> -----------
> 119994608
> (1 row)
>
> Time: 394908.501 ms
> llonergan=# select count(1) from lineitem;
> count
> -----------
> 119994608
> (1 row)
>
> Time: 99425.223 ms
> llonergan=# select count(1) from lineitem;
> count
> -----------
> 119994608
> (1 row)
>
> Time: 99187.205 ms
>
> -----------------------
> ext2:
> -----------------------
> llonergan=# select relpages from pg_class where relname='lineitem';
> relpages
> ----------
> 3159138
> (1 row)
>
> llonergan=# \timing
> Timing is on.
> llonergan=# select count(1) from lineitem;
> count
> -----------
> 119994608
> (1 row)
>
> Time: 395286.475 ms
> llonergan=# select count(1) from lineitem;
> count
> -----------
> 119994608
> (1 row)
>
> Time: 195756.381 ms
> llonergan=# select count(1) from lineitem;
> count
> -----------
> 119994608
> (1 row)
>
> Time: 122822.090 ms
> ============================================================================
> Analysis of Postgres 8.0.3 results
> ============================================================================
> ext2 xfs
> Write Speed 114 311
> Read Speed 332 407
> Postgres Seq Scan Speed 212 263
> Scan % of lmdd Read Speed 63.9% 64.6%
>
> Well - looks like we get linear scaling with disk/file subsystem speedup.
>
> - Luke
>
>
>
> ---------------------------(end of broadcast)---------------------------
> TIP 4: Have you searched our list archives?
>
> http://archives.postgresql.org
>

In response to

Browse pgsql-performance by date

  From Date Subject
Next Message Guillaume Smet 2005-11-22 14:26:43 Re: weird performances problem
Previous Message Marek Dabrowski 2005-11-22 14:22:59 System queue