Re: 2nd Level Buffer Cache

From: Radosław Smogura <rsmogura(at)softperience(dot)eu>
To: Merlin Moncure <mmoncure(at)gmail(dot)com>
Cc: Greg Stark <gsstark(at)mit(dot)edu>, Josh Berkus <josh(at)agliodbs(dot)com>, Jim Nasby <jim(at)nasby(dot)net>, Robert Haas <robertmhaas(at)gmail(dot)com>, Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>, PG Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: 2nd Level Buffer Cache
Date: 2011-03-23 15:50:06
Message-ID: 201103231650.07007.rsmogura@softperience.eu
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Merlin Moncure <mmoncure(at)gmail(dot)com> Monday 21 March 2011 20:58:16
> On Mon, Mar 21, 2011 at 2:08 PM, Greg Stark <gsstark(at)mit(dot)edu> wrote:
> > On Mon, Mar 21, 2011 at 3:54 PM, Merlin Moncure <mmoncure(at)gmail(dot)com>
wrote:
> >> Can't you make just one large mapping and lock it in 8k regions? I
> >> thought the problem with mmap was not being able to detect other
> >> processes
> >> (http://www.mail-archive.com/pgsql-general(at)postgresql(dot)org/msg122301.htm
> >> l) compatibility issues (possibly obsolete), etc.
> >
> > I was assuming that locking part of a mapping would force the kernel
> > to split the mapping. It has to record the locked state somewhere so
> > it needs a data structure that represents the size of the locked
> > section and that would, I assume, be the mapping.
> >
> > It's possible the kernel would not in fact fall over too badly doing
> > this. At some point I'll go ahead and do experiments on it. It's a bit
> > fraught though as it the performance may depend on the memory
> > management features of the chipset.
> >
> > That said, that's only part of the battle. On 32bit you can't map the
> > whole database as your database could easily be larger than your
> > address space. I have some ideas on how to tackle that but the
> > simplest test would be to just mmap 8kB chunks everywhere.
>
> Even on 64 bit systems you only have 48 bit address space which is not
> a theoretical limitation. However, at least on linux you can map in
> and map out pretty quick (10 microseconds paired on my linux vm) so
> that's not so big of a deal. Dealing with rapidly growing files is a
> problem. That said, probably you are not going to want to reserve
> multiple gigabytes in 8k non contiguous chunks.
>
> > But it's worse than that. Since you're not responsible for flushing
> > blocks to disk any longer you need some way to *unlock* a block when
> > it's possible to be flushed. That means when you flush the xlog you
> > have to somehow find all the blocks that might no longer need to be
> > locked and atomically unlock them. That would require new
> > infrastructure we don't have though it might not be too hard.
> >
> > What would be nice is a mlock_until() where you eventually issue a
> > call to tell the kernel what point in time you've reached and it
> > unlocks everything older than that time.
>
> I wonder if there is any reason to mlock at all...if you are going to
> 'do' mmap, can't you just hide under current lock architecture for
> actual locking and do direct memory access without mlock?
>
> merlin
I can't reproduce this. Simple test shows 2x faster read with mmap that
read();

I'm sending this what I done with mmap (really ugly, but I'm in forest). It is
read only solution, so init database first with some amount of data (I have
about 300MB) (2nd level scripts may do this for You).

This what I found:
1. If I not require to put new mmap (mmap with FIXED) in previous region (just
I do munmap / mmap) with each query, execution time grows, about 10%.

2. Sometimes is enough just to comment or uncomment something that do not have
side effects on code flow (bufmgr.c; (un)comment some unused if; put NULL, it
will be replaced), and e.g. query execution time may grow 2x.

3. My initial solution, was 2% faster, about 9ms when reading, now it's 10%
slower, after making them more usable.

Regards,
Radek

Attachment Content-Type Size
pg_mmap_20110323.patch.bz2 application/x-bzip 13.3 KB

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Radosław Smogura 2011-03-23 15:52:25 Re: 2nd Level Buffer Cache
Previous Message Robert Haas 2011-03-23 15:35:31 making write location work (was: Efficient transaction-controlled synchronous replication)