Re: [Lsf-pc] Linux kernel impact on PostgreSQL performance

From: Claudio Freire <klaussfreire(at)gmail(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: James Bottomley <James(dot)Bottomley(at)hansenpartnership(dot)com>, Andres Freund <andres(at)2ndquadrant(dot)com>, Josh Berkus <josh(at)agliodbs(dot)com>, Hannu Krosing <hannu(at)2ndquadrant(dot)com>, "lsf-pc(at)lists(dot)linux-foundation(dot)org" <lsf-pc(at)lists(dot)linux-foundation(dot)org>, Kevin Grittner <kgrittn(at)ymail(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>, Dave Chinner <david(at)fromorbit(dot)com>, Joshua Drake <jd(at)commandprompt(dot)com>, Mel Gorman <mgorman(at)suse(dot)de>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Trond Myklebust <trondmy(at)gmail(dot)com>, Magnus Hagander <magnus(at)hagander(dot)net>
Subject: Re: [Lsf-pc] Linux kernel impact on PostgreSQL performance
Date: 2014-01-14 18:43:03
Message-ID: CAGTBQpYaO345De38yh-LCkORgL8gdmhq+acGOd4PTBsMCJ2szQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, Jan 14, 2014 at 2:39 PM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> On Tue, Jan 14, 2014 at 12:20 PM, James Bottomley
> <James(dot)Bottomley(at)hansenpartnership(dot)com> wrote:
>> On Tue, 2014-01-14 at 15:15 -0200, Claudio Freire wrote:
>>> On Tue, Jan 14, 2014 at 2:12 PM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
>>> > In terms of avoiding double-buffering, here's my thought after reading
>>> > what's been written so far. Suppose we read a page into our buffer
>>> > pool. Until the page is clean, it would be ideal for the mapping to
>>> > be shared between the buffer cache and our pool, sort of like
>>> > copy-on-write. That way, if we decide to evict the page, it will
>>> > still be in the OS cache if we end up needing it again (remember, the
>>> > OS cache is typically much larger than our buffer pool). But if the
>>> > page is dirtied, then instead of copying it, just have the buffer pool
>>> > forget about it, because at that point we know we're going to write
>>> > the page back out anyway before evicting it.
>>> >
>>> > This would be pretty similar to copy-on-write, except without the
>>> > copying. It would just be forget-from-the-buffer-pool-on-write.
>>>
>>> But... either copy-on-write or forget-on-write needs a page fault, and
>>> thus a page mapping.
>>>
>>> Is a page fault more expensive than copying 8k?
>>>
>>> (I really don't know).
>>
>> A page fault can be expensive, yes ... but perhaps you don't need one.
>>
>> What you want is a range of memory that's read from a file but treated
>> as anonymous for writeout (i.e. written to swap if we need to reclaim
>> it). Then at some time later, you want to designate it as written back
>> to the file instead so you control the writeout order. I'm not sure we
>> can do this: the separation between file backed and anonymous pages is
>> pretty deeply ingrained into the OS, but if it were possible, is that
>> what you want?
>
> Doesn't sound exactly like what I had in mind. What I was suggesting
> is an analogue of read() that, if it reads full pages of data to a
> page-aligned address, shares the data with the buffer cache until it's
> first written instead of actually copying the data. The pages are
> write-protected so that an attempt to write the address range causes a
> page fault. In response to such a fault, the pages become anonymous
> memory and the buffer cache no longer holds a reference to the page.

Yes, that's basically zero-copy reads.

It could be done. The kernel can remap the page to the physical page
holding the shared buffer and mark it read-only, then expire the
buffer and transfer ownership of the page if any page fault happens.

But that incurrs:
- Page faults, lots
- Hugely bloated mappings, unless KSM is somehow leveraged for this

And there's a nice bingo. Had forgotten about KSM. KSM could help lots.

I could try to see of madvising shared_buffers as mergeable helps. But
this should be an automatic case of KSM - ie, when reading into a
page-aligned address, the kernel should summarily apply KSM-style
sharing without hinting. The current madvise interface puts the burden
of figuring out what duplicates what on the kernel, but postgres
already knows.

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Stephen Frost 2014-01-14 18:43:12 Re: [Lsf-pc] Linux kernel impact on PostgreSQL performance
Previous Message Jan Kara 2014-01-14 18:37:04 Re: [Lsf-pc] Linux kernel impact on PostgreSQL performance