Re: Shared buffers, db transactions commited, and write IO on Solaris

From: Erik Jones <erik(at)myemma(dot)com>
To: Dimitri <dimitrik(dot)fr(at)gmail(dot)com>
Cc: "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>, "PostgreSQL Performance" <pgsql-performance(at)postgresql(dot)org>
Subject: Re: Shared buffers, db transactions commited, and write IO on Solaris
Date: 2007-03-30 05:22:52
Message-ID: 84D24979-AF34-4875-9AFB-DD4A7814B99A@myemma.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance

On Mar 29, 2007, at 5:15 PM, Dimitri wrote:

>> >>
>> > Erik,
>> >
>> > using 'forcedirectio' simply brings your write operations to the
>> > *real* volume - means while you need to write 10 bytes you'll
>> write 10
>> > bytes (instead of UFS block size (8K)). So it explains me why your
>> > write volume became slower.
>
> I men 'lower' (not slower)
>
>>
>> Sorry, that's not true. Google "ufs forcedirectio" go to the first
>> link and you will find:
>>
>> "forcedirectio
>>
>> The forcedirectio (read "force direct IO") UFS option causes data to
>> be buffered in kernel address whenever data is transferred between
>> user address space and the disk. In other words, it bypasses the file
>> system cache. For certain types of applications -- primarily database
>> systems -- this option can dramatically improve performance. In fact,
>> some database experts have argued that a file using the forcedirectio
>> option will outperform a raw partition, though this opinion seems
>> fairly controversial.
>>
>> The forcedirectio improves file system performance by eliminating
>> double buffering, providing a small, efficient code path for file
>> system reads and writes and removing pressure on memory."
>
> Erik, please, don't take me wrong, but reading Google (or better
> man pages)
> don't replace brain and basic practice... Direct IO option is not a
> silver
> bullet which will solve all your problems (try to do 'cp' on the
> mounted in
> 'forcedirectio' filesystem, or use your mailbox on it - you'll quickly
> understand impact)...
>
>>
>> However, what this does mean is that writes will be at the actual
>> filesystem block size and not the cache block size (8K v. 512K).
>
> while UFS filesystem mounted normally, it uses its own cache for all
> operations (read and write) and saves data modifications on per
> page basis, means: when a process writes 200 bytes there will be 200
> bytes modified in cache, then whole page is written (8K) once data
> demanded to be flushed (and WAL is writing per each commit)...
>
> Now, mounted with 'forcedirectio' option UFS is free of page size
> constraint
> and will write like a raw device an exactly demanded amount of
> data, means:
> when a process writes 200 bytes it'll write exactly 200 bytes to
> the disk. =

You are right in that the page size constraint is lifted in that
directio cuts out the VM filesystem cache. However, the Solaris
kernel still issues io ops in terms of its logical block size (which
we have at the default 8K). It can issue io ops for fragments as
small as 1/8th of the block size, but Postgres issues its io requests
in terms of the block size which means that io ops from Postgres will
be in 8K chunks which is exactly what we see when we look at our
system io stats. In fact, if any io request is made that isn't a
multiple of 512 bytes (the disk sector size), the file system
switches back to the buffered io.

>
> However, to understand TX number mystery I think the only possible
> solution
> is to reproduce a small live test:
>
> (I'm sure you're aware you can mount/unmount forcedirectio
> dynamically?)
>
> during stable workload do:
>
> # mount -o remount,logging /path_to_your_filesystem
>
> and check if I/O volume is increasing as well TX numbers
> than come back:
>
> # mount -o remount,forcedirectio /path_to_your_filesystem
>
> and see if I/O volume is decreasing as well TX numbers...

That's an excellent idea and I'll run it by the rest of our team
tomorrow.

erik jones <erik(at)myemma(dot)com>
software developer
615-296-0838
emma(r)

In response to

Responses

Browse pgsql-performance by date

  From Date Subject
Next Message Vincenzo Romano 2007-03-30 06:36:16 Re: Weird performance drop
Previous Message Peter Schuller 2007-03-30 05:16:45 Scaling SELECT:s with the number of disks on a stripe