Quick Links

Re: Weird XFS WAL problem

From:	Merlin Moncure <mmoncure(at)gmail(dot)com>
To:	Craig James <craig_james(at)emolecules(dot)com>
Cc:	pgsql-performance(at)postgresql(dot)org
Subject:	Re: Weird XFS WAL problem
Date:	2010-06-03 13:01:01
Message-ID:	AANLkTimlamtRdDKYWLSf_FaKuNRPdNQLlkYE7hStfvP6@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-performance

On Wed, Jun 2, 2010 at 7:30 PM, Craig James <craig_james(at)emolecules(dot)com> wrote:
> I'm testing/tuning a new midsize server and ran into an inexplicable
> problem. With an RAID10 drive, when I move the WAL to a separate RAID1
> drive, TPS drops from over 1200 to less than 90! I've checked everything
> and can't find a reason.
>
> Here are the details.
>
> 8 cores (2x4 Intel Nehalem 2 GHz)
> 12 GB memory
> 12 x 7200 SATA 500 GB disks
> 3WARE 9650SE-12ML RAID controller with bbu
> 2 disks: RAID1 500GB ext4 blocksize=4096
> 8 disks: RAID10 2TB, stripe size 64K, blocksize=4096 (ext4 or xfs - see
> below)
> 2 disks: hot swap
> Ubuntu 10.04 LTS (Lucid)
>
> With xfs or ext4 on the RAID10 I got decent bonnie++ and pgbench results
> (this one is for xfs):
>
> Version 1.03e ------Sequential Output------ --Sequential Input-
> --Random-
> -Per Chr- --Block-- -Rewrite- -Per Chr- --Block--
> --Seeks--
> Machine Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP /sec
> %CP
> argon 24064M 70491 99 288158 25 129918 16 65296 97 428210 23
> 558.9 1
> ------Sequential Create------ --------Random
> Create--------
> -Create-- --Read--- -Delete-- -Create-- --Read---
> -Delete--
> files /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP /sec
> %CP
> 16 23283 81 +++++ +++ 13775 56 20143 74 +++++ +++ 15152
> 54
> argon,24064M,70491,99,288158,25,129918,16,65296,97,428210,23,558.9,1,16,23283,81,+++++,+++,13775,56,20143\
> ,74,+++++,+++,15152,54
>
> pgbench -i -s 100 -U test
> pgbench -c 10 -t 10000 -U test
> scaling factor: 100
> query mode: simple
> number of clients: 10
> number of transactions per client: 10000
> number of transactions actually processed: 100000/100000
> tps = 1046.104635 (including connections establishing)
> tps = 1046.337276 (excluding connections establishing)
>
> Now the mystery: I moved the pg_xlog directory to a RAID1 array (same 3WARE
> controller, two more SATA 7200 disks). Run the same tests and ...
>
> tps = 82.325446 (including connections establishing)
> tps = 82.326874 (excluding connections establishing)
>
> I thought I'd made a mistake, like maybe I moved the whole database to the
> RAID1 array, but I checked and double checked. I even watched the lights
> blink - the WAL was definitely on the RAID1 and the rest of Postgres on the
> RAID10.
>
> So I moved the WAL back to the RAID10 array, and performance jumped right
> back up to the >1200 TPS range.
>
> Next I check the RAID1 itself:
>
> dd if=/dev/zero of=./bigfile bs=8192 count=2000000
>
> which yielded 98.8 MB/sec - not bad. bonnie++ on the RAID1 pair showed good
> performance too:
>
> Version 1.03e ------Sequential Output------ --Sequential Input-
> --Random-
> -Per Chr- --Block-- -Rewrite- -Per Chr- --Block--
> --Seeks--
> Machine Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP /sec
> %CP
> argon 24064M 68601 99 110057 18 46534 6 59883 90 123053 7
> 471.3 1
> ------Sequential Create------ --------Random
> Create--------
> -Create-- --Read--- -Delete-- -Create-- --Read---
> -Delete--
> files /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP /sec
> %CP
> 16 +++++ +++ +++++ +++ +++++ +++ +++++ +++ +++++ +++ +++++
> +++
> argon,24064M,68601,99,110057,18,46534,6,59883,90,123053,7,471.3,1,16,+++++,+++,+++++,+++,+++++,+++,+++++,\
> +++,+++++,+++,+++++,+++
>
> So ... anyone have any idea at all how TPS drops to below 90 when I move the
> WAL to a separate RAID1 disk? Does this make any sense at all? It's
> repeatable. It happens for both ext4 and xfs. It's weird.
>
> You can even watch the disk lights and see it: the RAID10 disks are on
> almost constantly when the WAL is on the RAID10, but when you move the WAL
> over to the RAID1, its lights are dim and flicker a lot, like it's barely
> getting any data, and the RAID10 disk's lights barely go on at all.

*) Is your raid 1 configured writeback cache on the controller?
*) have you tried changing wal_sync_method to fdatasync?

merlin

In response to

Weird XFS WAL problem at 2010-06-02 23:30:28 from Craig James

Browse pgsql-performance by date

	From	Date	Subject
Next Message	Jori Jovanovich	2010-06-03 15:32:00	Re: SELECT ignoring index even though ORDER BY and LIMIT present
Previous Message	Matthew Wakeling	2010-06-03 10:15:45	Re: SELECT ignoring index even though ORDER BY and LIMIT present