Re: 2nd Level Buffer Cache

From: Merlin Moncure <mmoncure(at)gmail(dot)com>
To: Greg Stark <gsstark(at)mit(dot)edu>
Cc: Josh Berkus <josh(at)agliodbs(dot)com>, Jim Nasby <jim(at)nasby(dot)net>, Robert Haas <robertmhaas(at)gmail(dot)com>, Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>, rsmogura(at)softperience(dot)eu, PG Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: 2nd Level Buffer Cache
Date: 2011-03-21 15:54:08
Message-ID: AANLkTikYce6vhevkzoNGF0KwRyLGL=h7XNcJma0ktbix@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Mon, Mar 21, 2011 at 5:24 AM, Greg Stark <gsstark(at)mit(dot)edu> wrote:
> On Fri, Mar 18, 2011 at 11:55 PM, Josh Berkus <josh(at)agliodbs(dot)com> wrote:
>>> To take the opposite approach... has anyone looked at having the OS just manage all caching for us? Something like MMAPed shared buffers? Even if we find the issue with large shared buffers, we still can't dedicate serious amounts of memory to them because of work_mem issues. Granted, that's something else on the TODO list, but it really seems like we're re-inventing the wheels that the OS has already created here...
>
> A lot of people have talked about it. You can find references to mmap
> going at least as far back as 2001 or so. The problem is that it would
> depend on the OS implementing things in a certain way and guaranteeing
> things we don't think can be portably assumed. We would need to mlock
> large amounts of address space which most OS's don't allow, and we
> would need to at least mlock and munlock lots of small bits of memory
> all over the place which would create lots and lots of mappings which
> the kernel and hardware implementations would generally not
> appreciate.
>
>> As far as I know, no OS has a more sophisticated approach to eviction
>> than LRU.  And clock-sweep is a significant improvement on performance
>> over LRU for frequently accessed database objects ... plus our
>> optimizations around not overwriting the whole cache for things like VACUUM.
>
> The clock-sweep algorithm was standard OS design before you or I knew
> how to type. I would expect any half-decent OS to have sometihng at
> least as good -- perhaps better because it can rely on hardware
> features to handle things.
>
> However the second point is the crux of the issue and of all similar
> issues on where to draw the line between the OS and Postgres. The OS
> knows better about the hardware characteristics and can better
> optimize the overall system behaviour, but Postgres understands better
> its own access patterns and can better optimize its behaviour whereas
> the OS is stuck reverse-engineering what Postgres needs, usually from
> simple heuristics.
>
>>
>> 2-level caches work well for a variety of applications.
>
> I think 2-level caches with simple heuristics like "pin all the
> indexes" is unlikely to be helpful. At least it won't optimize the
> average case and I think that's been proven. It might be helpful for
> optimizing the worst-case which would reduce the standard deviation.
> Perhaps we're at the point now where that matters.
>
> Where it might be helpful is as a more refined version of the
> "sequential scans use limited set of buffers" patch. Instead of having
> each sequential scan use a hard coded number of buffers, perhaps all
> sequential scans should share a fraction of the global buffer pool
> managed separately from the main pool. Though in my thought
> experiments I don't see any real win here. In the current scheme if
> there's any sign the buffer is useful it gets thrown from the
> sequential scan's set of buffers to reuse anyways.
>
>> Now, what would be *really* useful is some way to avoid all the data
>> copying we do between shared_buffers and the FS cache.
>>
>
> Well the two options are mmap/mlock or directio. The former might be a
> fun experiment but I expect any OS to fall over pretty quickly when
> faced with thousands (or millions) of 8kB mappings. The latter would
> need Postgres to do async i/o and hopefully a global view of its i/o
> access patterns so it could do prefetching in a lot more cases.

Can't you make just one large mapping and lock it in 8k regions? I
thought the problem with mmap was not being able to detect other
processes (http://www.mail-archive.com/pgsql-general(at)postgresql(dot)org/msg122301.html)
compatibility issues (possibly obsolete), etc.

merlin

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Heikki Linnakangas 2011-03-21 16:00:42 Re: 2nd Level Buffer Cache
Previous Message Robert Haas 2011-03-21 15:44:46 Re: Planner regression in 9.1: min(x) cannot use partial index with NOT NULL