Re: Read/Write block sizes

From: Alan Stange <stange(at)rentec(dot)com>
To: Josh Berkus <josh(at)agliodbs(dot)com>
Cc: pgsql-performance(at)postgresql(dot)org, Steve Poe <spoe(at)sfnet(dot)cc>, Chris Browne <cbbrowne(at)acm(dot)org>
Subject: Re: Read/Write block sizes
Date: 2005-08-24 04:00:26
Message-ID: 430BF0DA.5010002@rentec.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance

Josh Berkus wrote:

>Steve,
>
>
>
>>I would assume that dbt2 with STP helps minimize the amount of hours
>>someone has to invest to determine performance gains with configurable
>>options?
>>
>>
>
>Actually, these I/O operation issues show up mainly with DW workloads, so the
>STP isn't much use there. If I can ever get some of these machines back
>from the build people, I'd like to start testing some stuff.
>
>One issue with testing this is that currently PostgreSQL doesn't support block
>sizes above 128K. We've already done testing on that (well, Mark has) and
>the performance gains aren't even worth the hassle of remembering you're on a
>different block size (like, +4%).
>
>
What size database was this on?

>What the Sun people have done with other DB systems is show that substantial
>performance gains are possible on large databases (>100G) using block sizes
>of 1MB. I believe that's possible (and that it probably makes more of a
>difference on Solaris than on BSD) but we can't test it without some hackery
>first.
>
We're running on a 100+GB database, with long streams of 8KB reads with
the occasional _llseek(). I've been thinking about running with a
larger blocksize with the expectation that we'd see fewer system calls
and a bit more throughput.

read() calls are a very expensive way to get 8KB of memory (that we know
is already resident) during scans. One has to trap into the kernel, do
the usual process state accounting, find the block, copy the memory to
userspace, return back from the kernel to user space reversing all the
process accounting, pick out the bytes one needs, and repeat all over
again. That's quite a few sacrificial cache lines for 8KB. Yeah,
sure, Linux syscalls are fast, but they aren't that fast, and other
operating systems (windows and solaris) have a bit more overhead on
syscalls.

Regarding large blocks sizes on Solaris: the Solaris folks can also use
large memory pages and avoid a lot of the TLB overhead from the VM
system. The various trapstat and cpustat commands can be quite
interesting to look at when running any large application on a Solaris
system.

It should be noted that having a large shared memory segment can be a
performance looser just from the standpoint of TLB thrashing. O(GB)
memory access patterns can take a huge performance hit in user space
with 4K pages compared to the kernel which would be mapping the "segmap"
(in Solaris parlance) with 4MB pages.

Anyway, I guess my point is that the balance between kernel managed vs.
postgresql managed buffer isn't obvious at all.

-- Alan

In response to

Browse pgsql-performance by date

  From Date Subject
Next Message Gavin Sherry 2005-08-24 04:38:02 Re: Caching by Postgres
Previous Message Jeffrey W. Baker 2005-08-24 03:07:34 Re: Read/Write block sizes