Skip site navigation (1) Skip section navigation (2)

Re: BBU Cache vs. spindles

From: david(at)lang(dot)hm
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, James Mansion <james(at)mansionfamily(dot)plus(dot)com>, Greg Smith <greg(at)2ndquadrant(dot)com>, Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>, Bruce Momjian <bruce(at)momjian(dot)us>, jd(at)commandprompt(dot)com, Scott Marlowe <scott(dot)marlowe(at)gmail(dot)com>, Steve Crawford <scrawford(at)pinpointresearch(dot)com>, pgsql-performance(at)postgresql(dot)org, Ben Chobot <bench(at)silentmedia(dot)com>
Subject: Re: BBU Cache vs. spindles
Date: 2010-10-29 19:09:02
Message-ID: (view raw, whole thread or download thread mbox)
Lists: pgsql-performancepgsql-www
On Fri, 29 Oct 2010, Robert Haas wrote:

> On Thu, Oct 28, 2010 at 5:26 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>> James Mansion <james(at)mansionfamily(dot)plus(dot)com> writes:
>>> Tom Lane wrote:
>>>> The other and probably worse problem is that there's no application
>>>> control over how soon changes to mmap'd pages get to disk.  An msync
>>>> will flush them out, but the kernel is free to write dirty pages sooner.
>>>> So if they're depending for consistency on writes not happening until
>>>> msync, it's broken by design.  (This is one of the big reasons we don't
>>>> use mmap'd space for Postgres disk buffers.)
>>> Well, I agree that it sucks for the reason you give - but you use
>>> write and that's *exactly* the same in terms of when it gets written,
>>> as when you update a byte on an mmap'd page.
>> Uh, no, it is not.  The difference is that we can update a byte in a
>> shared buffer, and know that it *isn't* getting written out before we
>> say so.  If the buffer were mmap'd then we'd have no control over that,
>> which makes it mighty hard to obey the WAL "write log before data"
>> paradigm.
>> It's true that we don't know whether write() causes an immediate or
>> delayed disk write, but we generally don't care that much.  What we do
>> care about is being able to ensure that a WAL write happens before the
>> data write, and with mmap we don't have control over that.
> Well, we COULD keep the data in shared buffers, and then copy it into
> an mmap()'d region rather than calling write(), but I'm not sure
> there's any advantage to it.  Managing address space mappings is a
> pain in the butt.

keep in mind that you have no way of knowing what order the data in the 
mmap region gets written out to disk.

David Lang
>From pgsql-performance-owner(at)postgresql(dot)org  Fri Oct 29 17:34:16 2010
Received: from ( [])
	by (Postfix) with ESMTP id A0D1C1337B63
	for <pgsql-performance-postgresql(dot)org(at)mail(dot)postgresql(dot)org>; Fri, 29 Oct 2010 17:34:15 -0300 (ADT)
Received: from ([])
 by ( []) (amavisd-maia, port 10024)
 with ESMTP id 12192-01
 for <pgsql-performance-postgresql(dot)org(at)mail(dot)postgresql(dot)org>;
 Fri, 29 Oct 2010 20:34:08 +0000 (UTC)
X-Greylist: delayed 00:06:39.061466 by SQLgrey-1.7.6
Received: from ( [])
	by (Postfix) with SMTP id 0E308133729E
	for <pgsql-performance(at)postgresql(dot)org>; Fri, 29 Oct 2010 17:34:07 -0300 (ADT)
Received: (qmail 22970 invoked by uid 503); 29 Oct 2010 20:40:06 -0000
Received: from (HELO (
  by with SMTP; 29 Oct 2010 20:40:06 -0000
Received: from (HELO queueout) (
	by with SMTP; 29 Oct 2010 22:27:25 +0200
Received: from (HELO apollo13) (lists%peufeu(dot)com(at)88(dot)161(dot)102(dot)87)
  by with SMTP; 29 Oct 2010 22:27:23 +0200
Content-Type: text/plain; charset=utf-8; format=flowed; delsp=yes
To: pgsql-performance(at)postgresql(dot)org, "Steve Wong" <powerpchead(at)yahoo(dot)com>
Subject: Re: MVCC and Implications for (Near) Real-Time Application
References: <203952(dot)56702(dot)qm(at)web111712(dot)mail(dot)gq1(dot)yahoo(dot)com>
Date: Fri, 29 Oct 2010 22:27:23 +0200
MIME-Version: 1.0
Content-Transfer-Encoding: 8bit
From: "Pierre C" <lists(at)peufeu(dot)com>
Message-ID: <op(dot)vlctrxpjeorkce(at)apollo13>
In-Reply-To: <203952(dot)56702(dot)qm(at)web111712(dot)mail(dot)gq1(dot)yahoo(dot)com>
User-Agent: Opera Mail/10.62 (Linux)
X-Ovh-Tracer-Id: 3326752751527134769
X-Ovh-Remote: (
X-Ovh-Local: (
X-Spam-Check: DONE|U 0.5/N
X-Virus-Scanned: Maia Mailguard 1.0.1
X-Spam-Status: No, hits=-1.9 tagged_above=-10 required=5 tests=BAYES_00=-1.9,
X-Archive-Number: 201010/589
X-Sequence-Number: 41007

> My questions are: (1) Does the MVCC architecture introduce significant  
> delays between insert by a thread and visibility by other threads

As said by others, once commited it is immediately visible to all

> (2) Are there any available benchmarks that can measure this delay?

Since you will not be batching INSERTs, you will use 1 INSERT per  
If you use Autocommit mode, that's it.
If you don't, you will get a few extra network roundtrips after the  
INSERT, to send the COMMIT.

One INSERT is usually extremely fast unless you're short on RAM and the  
indexes that need updating need some disk seeking.

Anyway, doing lots of INSERTs each in its own transaction is usually very  
low-throughput, because at each COMMIT, postgres must always be sure that  
all the data is actually written to the harddisks. So, depending on the  
speed of your harddisks, each COMMIT can take up to 10-20 milliseconds.

On a 7200rpm harddisk, it is absolutely impossible to do more than 7200  
commits/minute if you want to be sure each time that the data really is  
written on the harddisk, unless :

- you use several threads (one disk write can group several commits from  
different connections, see the config file docs)
- you turn of synchronous_commit ; in this case commit is instantaneous,  
but if your server loses power or crashes, the last few seconds of data  
may be lost (database integrity is still guaranteed though)
- you use a battery backup cache on your RAID controller, in this case  
"written to the harddisks" is replaced by "written to batteyr backed RAM"  
which is a lot faster

If you dont use battery backed cache, place the xlog on a different RAID1  
array than the tables/indexes, this allows committing of xlog records  
(which is the time critical part) to proceed smoothly and not be disturbed  
by other IO on the indexes/tables. Also consider tuning your bgwriter and  
checkpoints, after experimentation under realistic load conditions.

So, when you benchmark your application, if you get disappointing results,  
think about this...

In response to

pgsql-performance by date

Next:From: Robert HaasDate: 2010-10-29 19:15:02
Subject: Re: [PERFORM] typoed column name, but postgres didn't grump
Previous:From: Tom LaneDate: 2010-10-29 19:07:26
Subject: Re: [PERFORM] typoed column name, but postgres didn't grump

pgsql-www by date

Next:From: Greg Sabino MullaneDate: 2010-10-29 19:15:03
Subject: Re: Fwd: Can't post to hackers mailing list
Previous:From: Joshua D. DrakeDate: 2010-10-29 19:01:56
Subject: Re: Fwd: Can`t post to hackers mailing list

Privacy Policy | About PostgreSQL
Copyright © 1996-2017 The PostgreSQL Global Development Group