Re: 2nd Level Buffer Cache

From: Radosław Smogura <rsmogura(at)softperience(dot)eu>
To: Merlin Moncure <mmoncure(at)gmail(dot)com>
Cc: Greg Stark <gsstark(at)mit(dot)edu>, Josh Berkus <josh(at)agliodbs(dot)com>, Jim Nasby <jim(at)nasby(dot)net>, Robert Haas <robertmhaas(at)gmail(dot)com>, Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>, PG Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: 2nd Level Buffer Cache
Date: 2011-03-23 15:52:25
Message-ID: 201103231652.25488.rsmogura@softperience.eu
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Merlin Moncure <mmoncure(at)gmail(dot)com> Tuesday 22 March 2011 23:06:02
> On Tue, Mar 22, 2011 at 4:28 PM, Radosław Smogura
>
> <rsmogura(at)softperience(dot)eu> wrote:
> > Merlin Moncure <mmoncure(at)gmail(dot)com> Monday 21 March 2011 20:58:16
> >
> >> On Mon, Mar 21, 2011 at 2:08 PM, Greg Stark <gsstark(at)mit(dot)edu> wrote:
> >> > On Mon, Mar 21, 2011 at 3:54 PM, Merlin Moncure <mmoncure(at)gmail(dot)com>
> >
> > wrote:
> >> >> Can't you make just one large mapping and lock it in 8k regions? I
> >> >> thought the problem with mmap was not being able to detect other
> >> >> processes
> >> >> (http://www.mail-archive.com/pgsql-general(at)postgresql(dot)org/msg122301.h
> >> >> tm l) compatibility issues (possibly obsolete), etc.
> >> >
> >> > I was assuming that locking part of a mapping would force the kernel
> >> > to split the mapping. It has to record the locked state somewhere so
> >> > it needs a data structure that represents the size of the locked
> >> > section and that would, I assume, be the mapping.
> >> >
> >> > It's possible the kernel would not in fact fall over too badly doing
> >> > this. At some point I'll go ahead and do experiments on it. It's a bit
> >> > fraught though as it the performance may depend on the memory
> >> > management features of the chipset.
> >> >
> >> > That said, that's only part of the battle. On 32bit you can't map the
> >> > whole database as your database could easily be larger than your
> >> > address space. I have some ideas on how to tackle that but the
> >> > simplest test would be to just mmap 8kB chunks everywhere.
> >>
> >> Even on 64 bit systems you only have 48 bit address space which is not
> >> a theoretical limitation. However, at least on linux you can map in
> >> and map out pretty quick (10 microseconds paired on my linux vm) so
> >> that's not so big of a deal. Dealing with rapidly growing files is a
> >> problem. That said, probably you are not going to want to reserve
> >> multiple gigabytes in 8k non contiguous chunks.
> >>
> >> > But it's worse than that. Since you're not responsible for flushing
> >> > blocks to disk any longer you need some way to *unlock* a block when
> >> > it's possible to be flushed. That means when you flush the xlog you
> >> > have to somehow find all the blocks that might no longer need to be
> >> > locked and atomically unlock them. That would require new
> >> > infrastructure we don't have though it might not be too hard.
> >> >
> >> > What would be nice is a mlock_until() where you eventually issue a
> >> > call to tell the kernel what point in time you've reached and it
> >> > unlocks everything older than that time.
> >>
> >> I wonder if there is any reason to mlock at all...if you are going to
> >> 'do' mmap, can't you just hide under current lock architecture for
> >> actual locking and do direct memory access without mlock?
> >>
> >> merlin
> >
> > Actually after dealing with mmap and adding munmap I found crucial thing
> > why to not use mmap:
> > You need to munmap, and for me this takes much time, even if I read with
> > SHARED | PROT_READ, it's looks like Linux do flush or something else,
> > same as with MAP_FIXED, MAP_PRIVATE, etc.
>
> can you produce small program demonstrating the problem? This is not
> how things should work AIUI.
>
> I was thinking about playing with mmap implementation of clog system
> -- it's perhaps better fit. clog is rigidly defined size, and has
> very high performance requirements. Also it's much less changes than
> reimplementing heap buffering, and maybe not so much affected by
> munmap.
>
> merlin

Ah... just one thing, maybe usefull why performance is lost with huge memory.
I saw mmaped buffers are allocated in something like 0x007, so definitly above
4gb.

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Robert Haas 2011-03-23 15:53:30 Re: [COMMITTERS] pgsql: Efficient transaction-controlled synchronous replication.
Previous Message Radosław Smogura 2011-03-23 15:50:06 Re: 2nd Level Buffer Cache