Re: shared_buffers documentation

From: Greg Smith <greg(at)2ndquadrant(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: shared_buffers documentation
Date: 2010-04-17 01:47:30
Message-ID: 4BC91332.6060702@2ndquadrant.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Robert Haas wrote:
> Well, why can't they just hang out as dirty buffers in the OS cache,
> which is also designed to solve this problem?
>

If the OS were guaranteed to be as suitable for this purpose as the
approach taken in the database, this might work. But much like the
clock sweep approach should outperform a simpler OS caching
implementation in many common workloads, there are a couple of spots
where making dirty writes the OS's problem can fall down:

1) That presumes that OS write coalescing will solve the problem for you
by merging repeat writes, which depending on implementation it might not.

2) On some filesystems, such as ext3, any write with an fsync behind it
will flush the whole write cache out and defeat this optimization.
Since the spread checkpoint design has some such writes going to the
data disk in the middle of the currently processing checkpoing, in those
situations that's likely to push the first write of that block to disk
before it can be combined with a second. If you'd have kept it in the
buffer cache it might survive as long as a full checkpoint cycle longer..

3) The "timeout" as it were for shared buffers is driven by the distance
between checkpoints, typically as long as 5 minutes. The longest a
filesystem will hold onto a write is probably less. On Linux it's
typically 30 seconds before the OS considers a write important to get
out to disk, longest case; if you've already filled a lot of RAM with
writes it can be substantially less.

> I guess the obvious question is whether Windows "doesn't need" more
> shared memory than that, or whether it "can't effectively use" more
> memory than that.
>

It's probably can't effectively use. We know for a fact that
applications where blocks regularly accumulate high usage counts and
have repeat read/writes to them, which includes pgbench, benefit in
several easy to measure ways from using larger amounts of database
buffer cache. There's just plain old less churn of buffers going in and
out of there. The alternate explanation of "Windows is just so much
better at read/write caching that you should give it most of the RAM
anyway" doesn't really sound as probable as the more commonly proposed
theory "Windows doesn't handle large blocks of shared memory well".

Note that there's no discussion of the why behind this is in the commit
you just did, just the description of what happens. The reasons why are
left undefined, which I feel is appropriate given we really don't know
for sure. Still waiting for somebody to let loose the Visual Studio
profiler and measure what's causing the degradation at larger sizes.

--
Greg Smith 2ndQuadrant US Baltimore, MD
PostgreSQL Training, Services and Support
greg(at)2ndQuadrant(dot)com www.2ndQuadrant.us

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Robert Haas 2010-04-17 01:47:46 Re: Streaming replication and a disk full in primary
Previous Message Robert Haas 2010-04-17 01:47:00 Re: walreceiver is uninterruptible on win32