Re: Bug: Buffer cache is not scan resistant

From: "Simon Riggs" <simon(at)2ndquadrant(dot)com>
To: "Sherry Moore" <sherry(dot)moore(at)Sun(dot)COM>
Cc: "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>, "Luke Lonergan" <LLonergan(at)greenplum(dot)com>, "Mark Kirkwood" <markir(at)paradise(dot)net(dot)nz>, "Pavan Deolasee" <pavan(at)enterprisedb(dot)com>, "Gavin Sherry" <swm(at)alcove(dot)com(dot)au>, "PGSQL Hackers" <pgsql-hackers(at)postgresql(dot)org>, "Doug Rady" <drady(at)greenplum(dot)com>
Subject: Re: Bug: Buffer cache is not scan resistant
Date: 2007-03-06 22:27:37
Message-ID: 1173220058.3760.2140.camel@silverbirch.site
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Mon, 2007-03-05 at 21:34 -0800, Sherry Moore wrote:

> - Based on a lot of the benchmarks and workloads I traced, the
> target buffer of read operations are typically accessed again
> shortly after the read, while writes are usually not. Therefore,
> the default operation mode is to bypass L2 for writes, but not
> for reads.

Hi Sherry,

I'm trying to relate what you've said to how we should proceed from
here. My understanding of what you've said is:

- Tom's assessment that the observed performance quirk could be fixed in
the OS kernel is correct and you have the numbers to prove it

- currently Solaris only does NTA for 128K reads, which we don't
currently do. If we were to request 16 blocks at time, we would get this
benefit on Solaris, at least. The copyout_max_cached parameter can be
patched, but isn't a normal system tunable.

- other workloads you've traced *do* reuse the same buffer again very
soon afterwards when reading sequentially (not writes). Reducing the
working set size is an effective technique in improving performance if
we don't have a kernel that does NTA or we don't read in big enough
chunks (we need both to get NTA to kick in).

and what you haven't said

- all of this is orthogonal to the issue of buffer cache spoiling in
PostgreSQL itself. That issue does still exist as a non-OS issue, but
we've been discussing in detail the specific case of L2 cache effects
with specific kernel calls. All of the test results have been
stand-alone, so we've not done any measurements in that area. I say this
because you make the point that reducing the working set size of write
workloads has no effect on the L2 cache issue, but ISTM its still
potentially a cache spoiling issue.

--
Simon Riggs
EnterpriseDB http://www.enterprisedb.com

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Simon Riggs 2007-03-06 23:04:21 Re: Plan invalidation vs. unnamed prepared statements
Previous Message Joris Dobbelsteen 2007-03-06 21:54:50 Re: Auto creation of Partitions