Skip site navigation (1) Skip section navigation (2)

Re: Hardware/OS recommendations for large databases (

From: Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>
To: Greg Stark <gsstark(at)mit(dot)edu>
Cc: stange(at)rentec(dot)com, Luke Lonergan <llonergan(at)greenplum(dot)com>, Dave Cramer <pg(at)fastcrypt(dot)com>, Joshua Marsh <icub3d(at)gmail(dot)com>, pgsql-performance(at)postgresql(dot)org
Subject: Re: Hardware/OS recommendations for large databases (
Date: 2005-11-23 00:13:20
Message-ID: (view raw, whole thread or download thread mbox)
Lists: pgsql-performance
Greg Stark wrote:
> Alan Stange <stange(at)rentec(dot)com> writes:
> > The point your making doesn't match my experience with *any* storage or program
> > I've ever used, including postgresql.   Your point suggests that the storage
> > system is idle  and that postgresql is broken because it isn't able to use the
> > resources available...even when the cpu is very idle.  How can that make sense?
> Well I think what he's saying is that Postgres is issuing a read, then waiting
> for the data to return. Then it does some processing, and goes back to issue
> another read. The CPU is idle half the time because Postgres isn't capable of
> doing any work while waiting for i/o, and the i/o system is idle half the time
> while the CPU intensive part happens.
> (Consider as a pathological example a program that reads 8k then sleeps for
> 10ms, and loops doing that 1,000 times. Now consider the same program
> optimized to read 8M asynchronously and sleep for 10s. By the time it's
> finished sleeping it has probably read in all 8M. Whereas the program that
> read 8k in little chunks interleaved with small sleeps would probably take
> twice as long and appear to be entirely i/o-bound with 50% iowait and 50%
> idle.)
> It's a reasonable theory and it's not inconsistent with the results you sent.
> But it's not exactly proven either. Nor is it clear how to improve matters.
> Adding additional threads to handle the i/o adds an enormous amount of
> complexity and creates lots of opportunity for other contention that could
> easily eat all of the gains.

Perfect summary.  We have a background writer now.  Ideally we would
have a background reader, that reads-ahead blocks into the buffer cache.
The problem is that while there is a relatively long time between a
buffer being dirtied and the time it must be on disk (checkpoint time),
the read-ahead time is much shorter, requiring some kind of quick
"create a thread" approach that could easily bog us down as outlined

Right now the file system will do read-ahead for a heap scan (but not an
index scan), but even then, there is time required to get that kernel
block into the PostgreSQL shared buffers, backing up Luke's observation
of heavy memcpy() usage.

So what are our options?  mmap()?  I have no idea.  Seems larger page
size does help.

  Bruce Momjian                        |
  pgman(at)candle(dot)pha(dot)pa(dot)us               |  (610) 359-1001
  +  If your life is a hard drive,     |  13 Roberts Road
  +  Christ can be your backup.        |  Newtown Square, Pennsylvania 19073

In response to


pgsql-performance by date

Next:From: Ralph MasonDate: 2005-11-23 03:39:05
Subject: Binary Refcursor possible?
Previous:From: Anjan DaveDate: 2005-11-22 23:17:27
Subject: Re: High context switches occurring

Privacy Policy | About PostgreSQL
Copyright © 1996-2017 The PostgreSQL Global Development Group