Re: [Lsf-pc] Linux kernel impact on PostgreSQL performance

From: Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Trond Myklebust <trondmy(at)gmail(dot)com>, Bottomley James <James(dot)Bottomley(at)HansenPartnership(dot)com>, Hannu Krosing <hannu(at)2ndQuadrant(dot)com>, Claudio Freire <klaussfreire(at)gmail(dot)com>, Andres Freund <andres(at)2ndQuadrant(dot)com>, Josh Berkus <josh(at)agliodbs(dot)com>, Kevin Grittner <kgrittn(at)ymail(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>, Dave Chinner <david(at)fromorbit(dot)com>, Joshua Drake <jd(at)commandprompt(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Mel Gorman <mgorman(at)suse(dot)de>, "lsf-pc(at)lists(dot)linux-foundation(dot)org" <lsf-pc(at)lists(dot)linux-foundation(dot)org>, Magnus Hagander <magnus(at)hagander(dot)net>
Subject: Re: [Lsf-pc] Linux kernel impact on PostgreSQL performance
Date: 2014-01-14 17:03:29
Message-ID: 52D56DE1.6070009@vmware.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 01/14/2014 06:08 PM, Tom Lane wrote:
> Trond Myklebust <trondmy(at)gmail(dot)com> writes:
>> On Jan 14, 2014, at 10:39, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>>> "Don't be aggressive" isn't good enough. The prohibition on early write
>>> has to be absolute, because writing a dirty page before we've done
>>> whatever else we need to do results in a corrupt database. It has to
>>> be treated like a write barrier.
>
>> Then why are you dirtying the page at all? It makes no sense to tell the kernel “we’re changing this page in the page cache, but we don’t want you to change it on disk”: that’s not consistent with the function of a page cache.
>
> As things currently stand, we dirty the page in our internal buffers,
> and we don't write it to the kernel until we've written and fsync'd the
> WAL data that needs to get to disk first. The discussion here is about
> whether we could somehow avoid double-buffering between our internal
> buffers and the kernel page cache.

To be honest, I think the impact of double buffering in real-life
applications is greatly exaggerated. If you follow the usual guideline
and configure shared_buffers to 25% of available RAM, at worst you're
wasting 25% of RAM to double buffering. That's significant, but it's not
the end of the world, and it's a problem that can be compensated by
simply buying more RAM.

Of course, if someone can come up with an easy way to solve that, that'd
be great, but if it means giving up other advantages that we get from
relying on the OS page cache, then -1 from me. The usual response to the
"why don't you just use O_DIRECT?" is that it'd require reimplementing a
lot of I/O infrastructure, but misses an IMHO more important point: it
would require setting shared_buffers a lot higher to get the same level
of performance you get today. That has a number of problems:

1. It becomes a lot more important to tune shared_buffers correctly. Set
it too low, and you're not taking advantage of all the RAM available.
Set it too high, and you'll start swapping, totally killing performance.
I can already hear consultants rubbing their hands, waiting for the rush
of customers that will need expert help to determine the optimal
shared_buffers setting.

2. Memory spent on the buffer cache can't be used for other things. For
example, an index build can temporarily allocate several gigabytes of
memory; if that memory is allocated to the shared buffer cache, it can't
be used for that purpose. Yeah, we could change that, and allow
borrowing pages from the shared buffer cache for other purposes, but
that means more work and more code.

3. Memory used for the shared buffer cache can't be used by other
processes (without swapping). It becomes a lot harder to be a good
citizen on a system that's not entirely dedicated to PostgreSQL.

So not only would we need to re-implement I/O infrastructure, we'd also
need to make memory management a lot smarter and a lot more flexible.
We'd need a lot more information on what else is running on the system
and how badly they need memory.

> I personally think there is no chance of using mmap for that; the
> semantics of mmap are pretty much dictated by POSIX and they don't work
> for this.

Agreed. It would be possible to use mmap() for pages that are not
modified, though. When you're not modifying, you could mmap() the data
you need, and bypass the PostgreSQL buffer cache that way. The
interaction with the buffer cache becomes complicated, because you
couldn't use the buffer cache's locks etc., and some pages might have a
never version in the buffer cache than on-disk, but it might be doable.

- Heikki

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Kevin Grittner 2014-01-14 17:09:04 Re: [Lsf-pc] Linux kernel impact on PostgreSQL performance
Previous Message James Bottomley 2014-01-14 16:57:54 Re: [Lsf-pc] Linux kernel impact on PostgreSQL performance