> >> Yeah, but isn't that theory a hangover from pre-Unix operating systems?
> > Informix, oracle, etc all do raw device access bypassing the kernels
> > buffering, etc. So they need heaping gobules of memory to do the same
> > thing the kernel does..
> D'oh, I believe Jeff's put his finger on it. You need lotsa RAM if you
> are trying to bypass the OS. But Postgres would like to work with the
> OS, not bypass it.
> > but since they know the exact patterns of data and
> > how things will be done they can fine tune their buffer caches to get much
> > better performance than the kernel (15-20% in informix's case) since the
> > kernel needs to be a "works good generally"
> They go to all that work for 15-20% ??? Remind me not to follow that
> primrose path. I can think of lots of places where we can buy 20% for
> less work than implementing (and maintaining) our own raw-device access
This is related somewhat to the raw device access discussion. This is
a quote from Matt Dillion (FreeBSD VM guru) on the topic of disk
<200301270657(dot)h0R6v2qH071774(at)apollo(dot)backplane(dot)com>) and a few bits at
### Begin quote
Mmmmm. Basically what it comes down to is that without foreknowledge
of the data locations being accessed, it is not possible for any
cache algorithm to adapt to all the myriad ways data might be accessed.
If you focus the cache on one methodology it will probably perform
terribly when presented with some other methodology.
What this means is that for the cases where you *KNOW* how a program
intends to access a dataset larger then main memory, it is better
to have the program explicitly cache/not-cache the data under program
control rather then trying to force the system cache to adapt.
I'll also be nice and decode some of Terry's Jargon for the rest of
:will result in significant failure of random page replacement to
:result in cache hits; likewise, going to 85% overage will practically
:guarantee an almost 100% failure rate, as cyclical access with random
:replacement is statistically more likely, in aggregate, to replace
:the pages which are there longer (the probability is iterative and
:additive: it's effectively a permutation).
What Terry is saying is that if you have a dataset that is 2x
the size of your cache, the cache hit rate on that data with random
page replacement is NOT going to be 50%. This is because with random
page replacement the likelihood of a piece of data being found in
the cache depends on how long the data has been sitting in the cache.
The longer the data has been sitting in the cache, the less likely you
will find it when you need it (because it is more likely to have been
replaced by the random replacement algorithm over time).
So, again, the best caching methodology to use in the case where
you *know* how much data you will be accessing and how you will access
it is to build the caching directly into your program and not depend
on system caching at all (for datasets which are far larger then
This is true of all systems, not just BSD. This is one reason why
databases do their own caching (another is so databases know when an
access will require I/O for scheduling reasons, but that's a different
The FreeBSD VM caching system does prevent one process from exhausting
another process's cached data due to a sequential access, but the
FreeBSD VM cache does not try to outsmart sequential data accesses to
datasets which are larger then available cache space because it's an
insanely difficult (impossible) problem to solve properly without
foreknowledge of what data elements will be accessed when.
This isn't to say that we can't improve on what we have now.
I certainly think we can. But there is no magic bullet that will
deal with every situation.
### End quote
So if there really is only a 15-20% performance gain to be had from
using raw disk access, that 15-20% loss comes from not being able to
tell the OS what to cache, what not to cache, and what order to have
the pages in... which only really matters if there is RAM available to
the kernel to cache, and that it is able to determine what is valuable
to cache in the course of its operations. Predictive caching by the
OS isn't done because it understands PostgreSQL, because it
understands a generic algorithm for page hits/misses.
What is interesting after reading this, however, is the prospect of a
15-20% speed up on certain tables that we know are accessed frequently
by implicitly specifying a set of data to be preferred in a user space
cache. It's impossible for the OS to cache the pages that make the
biggest impact on user visible performance given the OS has no
understanding of what pages make a big difference on user visible
performance, a user land database process, however, would.
As things stand, it's entirely possible for a set of large queries to
come through and wipe the kernel's cache that smaller queries were
using. Once a cache misses, the kernel then has to fetch the data
again which could slow down over all number of transactions per
second. That said, this is something that an in-database scheduler
could avoid by placing a lower priority on larger, more complex
queries with the assumption being that having the smaller queries
continue to process and get in/out is more important than shaving a
few seconds off of a larger query that would deplete the cache used by
the smaller queries. Oh to be a DBA and being able to make those
decisions instead of the kernel...
Hrm, so two ideas or questions come to mind:
1) On some of my really large read only queries, it would be SUUUPER
nice to be able to re-nice the process from SQL land to 5, 10, or
even 20. IIRC, BSD's VM system is smart enough to prevent lower
priority jobs from monopolizing the disk cache, which would let the
smaller faster turn around queries, continue to exist with their
data in the kernel's disk cache. (some kind of query complexity
threshold that results in a reduction of priority or an explicit
directive to run at a lower priority)
2) Is there any way of specifying that a particular set of tables
should be kept in RAM or some kind of write through cache? I know
data is selected into a backend out of the catalogs, but would it
be possible to have them kept in memory and only re-read on change
with some kind of semaphore? Now that all system tables are in
their own schemas (pg_catalog and pg_toast), would it be hard to
set a flag on a change to those tables that would cause the
postmaster, or children, to re-read then instead of rely on their
cache? With copy-on-write forking, this could be pretty efficient
if the postmaster did this and forked off a copy with the tables
already in memory instead of on disk.
Just a few ideas/ramblings, hope someone finds them interesting... the
renice function is one that I think I'll spend some time looking into
here shortly actually. -sc
In response to
pgsql-performance by date
|Next:||From: Ron Johnson||Date: 2003-01-27 09:08:20|
|Subject: Re: LOCK TABLE & speeding up mass data loads|
|Previous:||From: Bruce Momjian||Date: 2003-01-27 00:54:25|
|Subject: Re: [PERFORM] optimizing query|