Re: 2nd Level Buffer Cache

From: Greg Stark <gsstark(at)mit(dot)edu>
To: Josh Berkus <josh(at)agliodbs(dot)com>
Cc: Jim Nasby <jim(at)nasby(dot)net>, Robert Haas <robertmhaas(at)gmail(dot)com>, Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>, rsmogura(at)softperience(dot)eu, PG Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: 2nd Level Buffer Cache
Date: 2011-03-21 10:24:22
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Fri, Mar 18, 2011 at 11:55 PM, Josh Berkus <josh(at)agliodbs(dot)com> wrote:
>> To take the opposite approach... has anyone looked at having the OS just manage all caching for us? Something like MMAPed shared buffers? Even if we find the issue with large shared buffers, we still can't dedicate serious amounts of memory to them because of work_mem issues. Granted, that's something else on the TODO list, but it really seems like we're re-inventing the wheels that the OS has already created here...

A lot of people have talked about it. You can find references to mmap
going at least as far back as 2001 or so. The problem is that it would
depend on the OS implementing things in a certain way and guaranteeing
things we don't think can be portably assumed. We would need to mlock
large amounts of address space which most OS's don't allow, and we
would need to at least mlock and munlock lots of small bits of memory
all over the place which would create lots and lots of mappings which
the kernel and hardware implementations would generally not

> As far as I know, no OS has a more sophisticated approach to eviction
> than LRU.  And clock-sweep is a significant improvement on performance
> over LRU for frequently accessed database objects ... plus our
> optimizations around not overwriting the whole cache for things like VACUUM.

The clock-sweep algorithm was standard OS design before you or I knew
how to type. I would expect any half-decent OS to have sometihng at
least as good -- perhaps better because it can rely on hardware
features to handle things.

However the second point is the crux of the issue and of all similar
issues on where to draw the line between the OS and Postgres. The OS
knows better about the hardware characteristics and can better
optimize the overall system behaviour, but Postgres understands better
its own access patterns and can better optimize its behaviour whereas
the OS is stuck reverse-engineering what Postgres needs, usually from
simple heuristics.

> 2-level caches work well for a variety of applications.

I think 2-level caches with simple heuristics like "pin all the
indexes" is unlikely to be helpful. At least it won't optimize the
average case and I think that's been proven. It might be helpful for
optimizing the worst-case which would reduce the standard deviation.
Perhaps we're at the point now where that matters.

Where it might be helpful is as a more refined version of the
"sequential scans use limited set of buffers" patch. Instead of having
each sequential scan use a hard coded number of buffers, perhaps all
sequential scans should share a fraction of the global buffer pool
managed separately from the main pool. Though in my thought
experiments I don't see any real win here. In the current scheme if
there's any sign the buffer is useful it gets thrown from the
sequential scan's set of buffers to reuse anyways.

> Now, what would be *really* useful is some way to avoid all the data
> copying we do between shared_buffers and the FS cache.

Well the two options are mmap/mlock or directio. The former might be a
fun experiment but I expect any OS to fall over pretty quickly when
faced with thousands (or millions) of 8kB mappings. The latter would
need Postgres to do async i/o and hopefully a global view of its i/o
access patterns so it could do prefetching in a lot more cases.


In response to


Browse pgsql-hackers by date

  From Date Subject
Next Message Heikki Linnakangas 2011-03-21 10:24:57 Re: Rectifying wrong Date outputs
Previous Message Heikki Linnakangas 2011-03-21 09:29:03 Re: Allowing multiple concurrent base backups