Re: [Lsf-pc] Linux kernel impact on PostgreSQL performance

From: James Bottomley <James(dot)Bottomley(at)HansenPartnership(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Andres Freund <andres(at)2ndquadrant(dot)com>, Josh Berkus <josh(at)agliodbs(dot)com>, Hannu Krosing <hannu(at)2ndquadrant(dot)com>, Trond Myklebust <trondmy(at)gmail(dot)com>, Kevin Grittner <kgrittn(at)ymail(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>, Dave Chinner <david(at)fromorbit(dot)com>, Joshua Drake <jd(at)commandprompt(dot)com>, Claudio Freire <klaussfreire(at)gmail(dot)com>, Mel Gorman <mgorman(at)suse(dot)de>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, "lsf-pc(at)lists(dot)linux-foundation(dot)org" <lsf-pc(at)lists(dot)linux-foundation(dot)org>, Magnus Hagander <magnus(at)hagander(dot)net>
Subject: Re: [Lsf-pc] Linux kernel impact on PostgreSQL performance
Date: 2014-01-14 20:00:29
Message-ID: 1389729629.2192.59.camel@dabdike.int.hansenpartnership.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, 2014-01-14 at 12:39 -0500, Robert Haas wrote:
> On Tue, Jan 14, 2014 at 12:20 PM, James Bottomley
> <James(dot)Bottomley(at)hansenpartnership(dot)com> wrote:
> > On Tue, 2014-01-14 at 15:15 -0200, Claudio Freire wrote:
> >> On Tue, Jan 14, 2014 at 2:12 PM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> >> > In terms of avoiding double-buffering, here's my thought after reading
> >> > what's been written so far. Suppose we read a page into our buffer
> >> > pool. Until the page is clean, it would be ideal for the mapping to
> >> > be shared between the buffer cache and our pool, sort of like
> >> > copy-on-write. That way, if we decide to evict the page, it will
> >> > still be in the OS cache if we end up needing it again (remember, the
> >> > OS cache is typically much larger than our buffer pool). But if the
> >> > page is dirtied, then instead of copying it, just have the buffer pool
> >> > forget about it, because at that point we know we're going to write
> >> > the page back out anyway before evicting it.
> >> >
> >> > This would be pretty similar to copy-on-write, except without the
> >> > copying. It would just be forget-from-the-buffer-pool-on-write.
> >>
> >> But... either copy-on-write or forget-on-write needs a page fault, and
> >> thus a page mapping.
> >>
> >> Is a page fault more expensive than copying 8k?
> >>
> >> (I really don't know).
> >
> > A page fault can be expensive, yes ... but perhaps you don't need one.
> >
> > What you want is a range of memory that's read from a file but treated
> > as anonymous for writeout (i.e. written to swap if we need to reclaim
> > it). Then at some time later, you want to designate it as written back
> > to the file instead so you control the writeout order. I'm not sure we
> > can do this: the separation between file backed and anonymous pages is
> > pretty deeply ingrained into the OS, but if it were possible, is that
> > what you want?
>
> Doesn't sound exactly like what I had in mind. What I was suggesting
> is an analogue of read() that, if it reads full pages of data to a
> page-aligned address, shares the data with the buffer cache until it's
> first written instead of actually copying the data.

The only way to make this happen is mmap the file to the buffer and use
MADV_WILLNEED.

> The pages are
> write-protected so that an attempt to write the address range causes a
> page fault. In response to such a fault, the pages become anonymous
> memory and the buffer cache no longer holds a reference to the page.

OK, so here I thought of another madvise() call to switch the region to
anonymous memory. A page fault works too, of course, it's just that one
per page in the mapping will be expensive.

Do you care about handling aliases ... what happens if someone else
reads from the file, or will that never occur? The reason for asking is
that it's much easier if someone else mmapping the file gets your
anonymous memory than we create an alias in the page cache.

James

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Robert Haas 2014-01-14 20:05:56 Re: Add CREATE support to event triggers
Previous Message Robert Haas 2014-01-14 19:50:12 Re: [PATCH] Doc fix for VACUUM FREEZE