Re: rough benchmarks, sata vs. ssd

From: CSS <css(at)morefoo(dot)com>
To: Ivan Voras <ivoras(at)freebsd(dot)org>
Cc: pgsql-performance(at)postgresql(dot)org
Subject: Re: rough benchmarks, sata vs. ssd
Date: 2012-02-11 06:35:17
Message-ID: 0888C8EF-27A5-4A58-8515-68F8005CF189@morefoo.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance


On Feb 3, 2012, at 6:23 AM, Ivan Voras wrote:

> On 31/01/2012 09:07, CSS wrote:
>> Hello all,
>>
>> Just wanted to share some results from some very basic benchmarking
>> runs comparing three disk configurations on the same hardware:
>>
>> http://morefoo.com/bench.html
>
> That's great!

Thanks. I did spend a fair amount of time on it. It was also a
good excuse to learn a little about gnuplot, which I used to draw
the (somewhat oddly combined) system stats. I really wanted to see
IO and CPU info over the duration of a test even if I couldn't
really know what part of the test was running. Don't ask me why
iostat sometimes shows greater than 100% in the "busy" column
though. It is in the raw iostat output I used to create the graphs.

>
>> *Tyan B7016 mainboard w/onboard LSI SAS controller
>> *2x4 core xeon E5506 (2.13GHz)
>> *64GB ECC RAM (8GBx8 ECC, 1033MHz)
>> *2x250GB Seagate SATA 7200.9 (ST3250824AS) drives (yes, old and slow)
>> *2x160GB Intel 320 SSD drives
>
> It shows that you can have large cheap SATA drives and small fast SSD-s, and up to a point have best of both worlds. Could you send me (privately) a tgz of the results (i.e. the pages+images from the above URL), I'd like to host them somewhere more permanently.

Sent offlist, including raw vmstat, iostat and zpool iostat output.

>
>> The ZIL is a bit of a cheat, as it allows you to throw all the
>> synchronous writes to the SSD
>
> This is one of the main reasons it was made. It's not a cheat, it's by design.

I meant that only in the best way. Some of my proudest achievements
are cheats. :)

It's a clever way of moving cache to something non-volatile and
providing a fallback, although the fallback would be insanely slow
in comparison.

>
>> Why ZFS? Well, we adopted it pretty early for other tasks and it
>> makes a number of tasks easy. It's been stable for us for the most
>> part and our latest wave of boxes all use cheap SATA disks, which
>> gives us two things - a ton of cheap space (in 1U) for snapshots and
>> all the other space-consuming toys ZFS gives us, and on this cheaper
>> disk type, a guarantee that we're not dealing with silent data
>> corruption (these are probably the normal fanboy talking points).
>> ZFS snapshots are also a big time-saver when benchmarking. For our
>> own application testing I load the data once, shut down postgres,
>> snapshot pgsql + the app homedir and start postgres. After each run
>> that changes on-disk data, I simply rollback the snapshot.
>
> Did you tune ZFS block size for the postgresql data directory (you'll need to re-create the file system to do this)? When I investigated it in the past, it really did help performance.

I actually did not. A year or so ago I was doing some basic tests
on cheap SATA drives with ZFS and at least with pgbench, I could see
no difference at all. I actually still have some of that info, so
I'll include it here. This was a 4-core xeon, E5506 2.1GHZ, 4 1TB
WD RE3 drives in a RAIDZ1 array, 8GB RAM.

I tested three things - time to load an 8.5GB dump of one of our
dbs, time to run through a querylog of real data (1.4M queries), and
then pgbench with a scaling factor of 100, 20 clients, 10K
transactions per client.

default 128K zfs recordsize:

-9 minutes to load data
-17 minutes to run query log
-pgbench output

transaction type: TPC-B (sort of)
scaling factor: 100
query mode: simple
number of clients: 20
number of transactions per client: 10000
number of transactions actually processed: 200000/200000
tps = 100.884540 (including connections establishing)
tps = 100.887593 (excluding connections establishing)

8K zfs recordsize (wipe data dir and reinit db)

-10 minutes to laod data
-21 minutes to run query log
-pgbench output

transaction type: TPC-B (sort of)
scaling factor: 100
query mode: simple
number of clients: 20
number of transactions per client: 10000
number of transactions actually processed: 200000/200000
tps = 97.896038 (including connections establishing)
tps = 97.898279 (excluding connections establishing)

Just thought I'd include that since I have the data.

>
>> I don't have any real questions for the list, but I'd love to get
>> some feedback, especially on the ZIL results. The ZIL results
>> interest me because I have not settled on what sort of box we'll be
>> using as a replication slave for this one - I was going to either go
>> the somewhat risky route of another all-SSD box or looking at just
>> how cheap I can go with lots of 2.5" SAS drives in a 2U.
>
> You probably know the answer to that: if you need lots of storage, you'll probably be better off using large SATA drives with small SSDs for the ZIL. 160 GB is probably more than you need for ZIL.
>
> One thing I never tried is mirroring a SATA drive and a SSD (only makes sense if you don't trust SSDs to be reliable yet) - I don't know if ZFS would recognize the assymetry and direct most of the read requests to the SSD.

Our databases are pretty tiny. We could squeeze them on a pair of 160GB mirrored SSDs.

To be honest, the ZIL results really threw me for a loop. I had supposed that it would work well with bursty usage but that eventually the SATA drives would still be a choke point during heavy sustained sync writes since the difference in random sync write performance between the ZIL drives (SSD) and the actual data drives (SATA) was so huge. The benchmarks ran for quite some time and I am not spotting a point in the system graphs where the SATA gets truly saturated to the point that performance suffers.

I now have to think about whether a safe replication slave/backup could be built in 1U with 4 2.5 SAS drives and a small mirrored pair of SSDs for ZIL. We've been trying to avoid building monster boxes - not only are 2.5" SAS drives expensive, but so is whatever case you find to hold a dozen or so of them. Outside of some old Sun blog posts, I am finding little evidence of people running PostgreSQL on ZFS with SATA drives augmented with SSD ZIL. I'd love to hear more feedback on that.

>
>> If you have any test requests that can be quickly run on the above
>> hardware, let me know.
>
> Blogbench (benchmarks/blogbench) results are always nice to see in a comparison.

I don't know much about it, but here's what I get on the zfs mirrored SSD pair:

[root(at)bltest1 /usr/ports/benchmarks/blogbench]# blogbench -d /tmp/bbench

Frequency = 10 secs
Scratch dir = [/tmp/bbench]
Spawning 3 writers...
Spawning 1 rewriters...
Spawning 5 commenters...
Spawning 100 readers...
Benchmarking for 30 iterations.
The test will run during 5 minutes.
[…]

Final score for writes: 182
Final score for reads : 316840

Thanks,

Charles

>
> --
> Sent via pgsql-performance mailing list (pgsql-performance(at)postgresql(dot)org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-performance

In response to

Responses

Browse pgsql-performance by date

  From Date Subject
Next Message Jeff Janes 2012-02-11 16:26:35 Re: random_page_cost = 2.0 on Heroku Postgres
Previous Message Peter van Hardenberg 2012-02-11 02:13:07 Re: random_page_cost = 2.0 on Heroku Postgres