Re: Linux max on shared buffers?

From: Martijn van Oosterhout <kleptog(at)svana(dot)org>
To: Jan Wieck <JanWieck(at)Yahoo(dot)com>
Cc: Curt Sampson <cjs(at)cynic(dot)net>, GB Clark <postgres(at)vsservices(dot)com>, glenebob(at)nwlink(dot)com, pgsql-general(at)postgresql(dot)org
Subject: Re: Linux max on shared buffers?
Date: 2002-07-20 07:37:14
Message-ID: 20020720173714.A17364@svana.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

On Fri, Jul 19, 2002 at 10:41:28AM -0400, Jan Wieck wrote:
> Curt Sampson wrote:
> >
> > On Wed, 17 Jul 2002, GB Clark wrote:
> >
> > > Not all platforms have mmap. This has been discussed before I belive.
> >
> > I've now heard several times here that not all platforms have mmap
> > and/or mmap is not compatable across all platforms. I've yet to
> > see any solid evidence of this, however, and I'm inclined to believe
> > that mmap compatability is no worse than compatability with the
> > system V shared memory we're using already, since both are fairly
> > specifically defined by POSIX.
>
> Curt,
>
> I still don't completely understand what you are proposing. What I
> understood so far is that you want to avoid double buffering (OS buffer
> plus SHMEM). Wouldn't that require that the access to a block in the
> file (table, index, sequence, ...) has to go directly through a mmapped
> region of that file?

Well, you would have to deal with the fact that writing changes to a mmap()
is allowed, but you have no guarentee when it will be finally written. Given
WAL I would suggest using mmap() for reading only and using write() to
update the file.

> Let's create a little test case to discuss. I have two tables, 2
> Gigabyte in size each (making 4 segments of 1 GB total) plus a 512 MB
> index for each. Now I join them in a query, that results in a nestloop
> doing index scans.
>
> On a 32 bit system you cannot mmap both tables plus the indexes at the
> same time completely. But the access of the execution plan is reading
> one tables index, fetching the heap tuples from it by random access, and
> inside of that loop doing the same for the second table. So chances are,
> that this plan randomly peeks around in the entire 5 Gigabyte, at least
> you cannot predict which blocks it will need.

Correct.

> So far so good. Now what do you map when? Can you map multiple
> noncontigous 8K blocks out of each file? If so, how do you coordinate
> that all backends in summary use at maximum the number of blocks you
> want PostgreSQL to use (each unique block counts, regardless of how many
> backends have it mmap()'d, right?). And if a backend needs a block and
> the max is reached already, how does it tell the other backends to unmap
> something?

You can mmap() any portions anywhere (except for PA-RISC as Tom pointed
out). I was thinking in 8MB lots to avoid doing too many system calls (also
on i386, this chunk size could save the kernel making many page tables). You
don't need any coordination between backends over memory usage. The mmap()
is merely a window into the kernels disk cache. You are not currently
limiting the disk cache of the kernel, nor would it be senseble to do so.

If you need a block, you simply dereference the appropriate pointer (after
checking you have mmap()ed it in). If the data is in memory, the dereference
succeeds. If it's no, you get a page fault, the data is fetched and the
dereference succeeds on the second try. If in that process the kernel needed
to throw out another page, who cares? If another backend needs that page
it'll get read back in.

One case where this would be useful would be i386 machine with 64GB of
memory. Then you are in effect simply mapping different parts of the cache
at different times. No blocks are copied *ever*.

> I assume I am missing something very important here, or I am far off
> with my theory and the solution looks totally different. So could you
> please tell me how this is going to work?

It is different. I beleive you would still need some form of shared memory
to co-ordinate write()s.

--
Martijn van Oosterhout <kleptog(at)svana(dot)org> http://svana.org/kleptog/
> There are 10 kinds of people in the world, those that can do binary
> arithmetic and those that can't.

In response to

Responses

Browse pgsql-general by date

  From Date Subject
Next Message stefan 2002-07-20 07:39:52 Re: id and ID in CREATE TABLE
Previous Message stefan 2002-07-20 07:15:35 id and ID in CREATE TABLE