Re: BBU Cache vs. spindles

From: david(at)lang(dot)hm
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, James Mansion <james(at)mansionfamily(dot)plus(dot)com>, Greg Smith <greg(at)2ndquadrant(dot)com>, Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>, Bruce Momjian <bruce(at)momjian(dot)us>, jd(at)commandprompt(dot)com, Scott Marlowe <scott(dot)marlowe(at)gmail(dot)com>, Steve Crawford <scrawford(at)pinpointresearch(dot)com>, pgsql-performance(at)postgresql(dot)org, Ben Chobot <bench(at)silentmedia(dot)com>
Subject: Re: BBU Cache vs. spindles
Date: 2010-10-29 19:09:02
Message-ID: alpine.DEB.2.00.1010291206351.13891@asgard.lang.hm
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance pgsql-www

On Fri, 29 Oct 2010, Robert Haas wrote:

> On Thu, Oct 28, 2010 at 5:26 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>> James Mansion <james(at)mansionfamily(dot)plus(dot)com> writes:
>>> Tom Lane wrote:
>>>> The other and probably worse problem is that there's no application
>>>> control over how soon changes to mmap'd pages get to disk.  An msync
>>>> will flush them out, but the kernel is free to write dirty pages sooner.
>>>> So if they're depending for consistency on writes not happening until
>>>> msync, it's broken by design.  (This is one of the big reasons we don't
>>>> use mmap'd space for Postgres disk buffers.)
>>
>>> Well, I agree that it sucks for the reason you give - but you use
>>> write and that's *exactly* the same in terms of when it gets written,
>>> as when you update a byte on an mmap'd page.
>>
>> Uh, no, it is not.  The difference is that we can update a byte in a
>> shared buffer, and know that it *isn't* getting written out before we
>> say so.  If the buffer were mmap'd then we'd have no control over that,
>> which makes it mighty hard to obey the WAL "write log before data"
>> paradigm.
>>
>> It's true that we don't know whether write() causes an immediate or
>> delayed disk write, but we generally don't care that much.  What we do
>> care about is being able to ensure that a WAL write happens before the
>> data write, and with mmap we don't have control over that.
>
> Well, we COULD keep the data in shared buffers, and then copy it into
> an mmap()'d region rather than calling write(), but I'm not sure
> there's any advantage to it. Managing address space mappings is a
> pain in the butt.

keep in mind that you have no way of knowing what order the data in the
mmap region gets written out to disk.

David Lang
>From pgsql-performance-owner(at)postgresql(dot)org Fri Oct 29 17:34:16 2010
Received: from maia.hub.org (maia-3.hub.org [200.46.204.243])
by mail.postgresql.org (Postfix) with ESMTP id A0D1C1337B63
for <pgsql-performance-postgresql(dot)org(at)mail(dot)postgresql(dot)org>; Fri, 29 Oct 2010 17:34:15 -0300 (ADT)
Received: from mail.postgresql.org ([200.46.204.86])
by maia.hub.org (mx1.hub.org [200.46.204.243]) (amavisd-maia, port 10024)
with ESMTP id 12192-01
for <pgsql-performance-postgresql(dot)org(at)mail(dot)postgresql(dot)org>;
Fri, 29 Oct 2010 20:34:08 +0000 (UTC)
X-Greylist: delayed 00:06:39.061466 by SQLgrey-1.7.6
Received: from 29.mail-out.ovh.net (29.mail-out.ovh.net [87.98.216.213])
by mail.postgresql.org (Postfix) with SMTP id 0E308133729E
for <pgsql-performance(at)postgresql(dot)org>; Fri, 29 Oct 2010 17:34:07 -0300 (ADT)
Received: (qmail 22970 invoked by uid 503); 29 Oct 2010 20:40:06 -0000
Received: from b9.ovh.net (HELO mail641.ha.ovh.net) (213.186.33.59)
by 29.mail-out.ovh.net with SMTP; 29 Oct 2010 20:40:06 -0000
Received: from b0.ovh.net (HELO queueout) (213.186.33.50)
by b0.ovh.net with SMTP; 29 Oct 2010 22:27:25 +0200
Received: from par69-8-88-161-102-87.fbx.proxad.net (HELO apollo13) (lists%peufeu(dot)com(at)88(dot)161(dot)102(dot)87)
by ns0.ovh.net with SMTP; 29 Oct 2010 22:27:23 +0200
Content-Type: text/plain; charset=utf-8; format=flowed; delsp=yes
To: pgsql-performance(at)postgresql(dot)org, "Steve Wong" <powerpchead(at)yahoo(dot)com>
Subject: Re: MVCC and Implications for (Near) Real-Time Application
References: <203952(dot)56702(dot)qm(at)web111712(dot)mail(dot)gq1(dot)yahoo(dot)com>
Date: Fri, 29 Oct 2010 22:27:23 +0200
MIME-Version: 1.0
Content-Transfer-Encoding: 8bit
From: "Pierre C" <lists(at)peufeu(dot)com>
Message-ID: <op(dot)vlctrxpjeorkce(at)apollo13>
In-Reply-To: <203952(dot)56702(dot)qm(at)web111712(dot)mail(dot)gq1(dot)yahoo(dot)com>
User-Agent: Opera Mail/10.62 (Linux)
X-Ovh-Tracer-Id: 3326752751527134769
X-Ovh-Remote: 88.161.102.87 (par69-8-88-161-102-87.fbx.proxad.net)
X-Ovh-Local: 213.186.33.20 (ns0.ovh.net)
X-Spam-Check: DONE|U 0.5/N
X-Virus-Scanned: Maia Mailguard 1.0.1
X-Spam-Status: No, hits=-1.9 tagged_above=-10 required=5 tests=BAYES_00=-1.9,
RCVD_IN_DNSWL_NONE=-0.0001
X-Spam-Level:
X-Archive-Number: 201010/589
X-Sequence-Number: 41007

> My questions are: (1) Does the MVCC architecture introduce significant
> delays between insert by a thread and visibility by other threads

As said by others, once commited it is immediately visible to all

> (2) Are there any available benchmarks that can measure this delay?

Since you will not be batching INSERTs, you will use 1 INSERT per
transaction.
If you use Autocommit mode, that's it.
If you don't, you will get a few extra network roundtrips after the
INSERT, to send the COMMIT.

One INSERT is usually extremely fast unless you're short on RAM and the
indexes that need updating need some disk seeking.

Anyway, doing lots of INSERTs each in its own transaction is usually very
low-throughput, because at each COMMIT, postgres must always be sure that
all the data is actually written to the harddisks. So, depending on the
speed of your harddisks, each COMMIT can take up to 10-20 milliseconds.

On a 7200rpm harddisk, it is absolutely impossible to do more than 7200
commits/minute if you want to be sure each time that the data really is
written on the harddisk, unless :

- you use several threads (one disk write can group several commits from
different connections, see the config file docs)
- you turn of synchronous_commit ; in this case commit is instantaneous,
but if your server loses power or crashes, the last few seconds of data
may be lost (database integrity is still guaranteed though)
- you use a battery backup cache on your RAID controller, in this case
"written to the harddisks" is replaced by "written to batteyr backed RAM"
which is a lot faster

If you dont use battery backed cache, place the xlog on a different RAID1
array than the tables/indexes, this allows committing of xlog records
(which is the time critical part) to proceed smoothly and not be disturbed
by other IO on the indexes/tables. Also consider tuning your bgwriter and
checkpoints, after experimentation under realistic load conditions.

So, when you benchmark your application, if you get disappointing results,
think about this...

In response to

Browse pgsql-performance by date

  From Date Subject
Next Message Robert Haas 2010-10-29 19:15:02 Re: [PERFORM] typoed column name, but postgres didn't grump
Previous Message Tom Lane 2010-10-29 19:07:26 Re: [PERFORM] typoed column name, but postgres didn't grump

Browse pgsql-www by date

  From Date Subject
Next Message Greg Sabino Mullane 2010-10-29 19:15:03 Re: Fwd: Can't post to hackers mailing list
Previous Message Joshua D. Drake 2010-10-29 19:01:56 Re: Fwd: Can`t post to hackers mailing list