Re: Hardware/OS recommendations for large databases (

From: Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>
To: Greg Stark <gsstark(at)mit(dot)edu>
Cc: stange(at)rentec(dot)com, Luke Lonergan <llonergan(at)greenplum(dot)com>, Dave Cramer <pg(at)fastcrypt(dot)com>, Joshua Marsh <icub3d(at)gmail(dot)com>, pgsql-performance(at)postgresql(dot)org
Subject: Re: Hardware/OS recommendations for large databases (
Date: 2005-11-23 00:13:20
Message-ID: 200511230013.jAN0DKV10698@candle.pha.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance

Greg Stark wrote:
>
> Alan Stange <stange(at)rentec(dot)com> writes:
>
> > The point your making doesn't match my experience with *any* storage or program
> > I've ever used, including postgresql. Your point suggests that the storage
> > system is idle and that postgresql is broken because it isn't able to use the
> > resources available...even when the cpu is very idle. How can that make sense?
>
> Well I think what he's saying is that Postgres is issuing a read, then waiting
> for the data to return. Then it does some processing, and goes back to issue
> another read. The CPU is idle half the time because Postgres isn't capable of
> doing any work while waiting for i/o, and the i/o system is idle half the time
> while the CPU intensive part happens.
>
> (Consider as a pathological example a program that reads 8k then sleeps for
> 10ms, and loops doing that 1,000 times. Now consider the same program
> optimized to read 8M asynchronously and sleep for 10s. By the time it's
> finished sleeping it has probably read in all 8M. Whereas the program that
> read 8k in little chunks interleaved with small sleeps would probably take
> twice as long and appear to be entirely i/o-bound with 50% iowait and 50%
> idle.)
>
> It's a reasonable theory and it's not inconsistent with the results you sent.
> But it's not exactly proven either. Nor is it clear how to improve matters.
> Adding additional threads to handle the i/o adds an enormous amount of
> complexity and creates lots of opportunity for other contention that could
> easily eat all of the gains.

Perfect summary. We have a background writer now. Ideally we would
have a background reader, that reads-ahead blocks into the buffer cache.
The problem is that while there is a relatively long time between a
buffer being dirtied and the time it must be on disk (checkpoint time),
the read-ahead time is much shorter, requiring some kind of quick
"create a thread" approach that could easily bog us down as outlined
above.

Right now the file system will do read-ahead for a heap scan (but not an
index scan), but even then, there is time required to get that kernel
block into the PostgreSQL shared buffers, backing up Luke's observation
of heavy memcpy() usage.

So what are our options? mmap()? I have no idea. Seems larger page
size does help.

--
Bruce Momjian | http://candle.pha.pa.us
pgman(at)candle(dot)pha(dot)pa(dot)us | (610) 359-1001
+ If your life is a hard drive, | 13 Roberts Road
+ Christ can be your backup. | Newtown Square, Pennsylvania 19073

In response to

Responses

Browse pgsql-performance by date

  From Date Subject
Next Message Ralph Mason 2005-11-23 03:39:05 Binary Refcursor possible?
Previous Message Anjan Dave 2005-11-22 23:17:27 Re: High context switches occurring