Re: [Lsf-pc] Linux kernel impact on PostgreSQL performance

From: Jan Kara <jack(at)suse(dot)cz>
To: Kevin Grittner <kgrittn(at)ymail(dot)com>
Cc: Jan Kara <jack(at)suse(dot)cz>, Hannu Krosing <hannu(at)2ndQuadrant(dot)com>, Dave Chinner <david(at)fromorbit(dot)com>, Andres Freund <andres(at)2ndquadrant(dot)com>, Josh Berkus <josh(at)agliodbs(dot)com>, Trond Myklebust <trondmy(at)gmail(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>, Joshua Drake <jd(at)commandprompt(dot)com>, James Bottomley <James(dot)Bottomley(at)HansenPartnership(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Mel Gorman <mgorman(at)suse(dot)de>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, "lsf-pc(at)lists(dot)linux-foundation(dot)org" <lsf-pc(at)lists(dot)linux-foundation(dot)org>, Magnus Hagander <magnus(at)hagander(dot)net>
Subject: Re: [Lsf-pc] Linux kernel impact on PostgreSQL performance
Date: 2014-01-14 18:37:04
Message-ID: 20140114183704.GA27863@quack.suse.cz
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue 14-01-14 06:42:43, Kevin Grittner wrote:
> First off, I want to give a +1 on everything in the recent posts
> from Heikki and Hannu.
>
> Jan Kara <jack(at)suse(dot)cz> wrote:
>
> > Now the aging of pages marked as volatile as it is currently
> > implemented needn't be perfect for your needs but you still have
> > time to influence what gets implemented... Actually developers of
> > the vrange() syscall were specifically looking for some ideas
> > what to base aging on. Currently I think it is first marked -
> > first evicted.
>
> The "first marked - first evicted" seems like what we would want.
> The ability to "unmark" and have the page no longer be considered
> preferred for eviction would be very nice.  That seems to me like
> it would cover the multiple layers of buffering *clean* pages very
> nicely (although I know nothing more about vrange() than what has
> been said on this thread, so I could be missing something).
Here:
http://www.spinics.net/lists/linux-mm/msg67328.html
is an email which introduces the syscall. As you say, it might be a
reasonable fit for your problems with double caching of clean pages.

> The other side of that is related avoiding multiple writes of the
> same page as much as possible, while avoid write gluts.  The issue
> here is that PostgreSQL tries to hang on to dirty pages for as long
> as possible before "writing" them to the OS cache, while the OS
> tries to avoid writing them to storage for as long as possible
> until they reach a (configurable) threshold or are fsync'd.  The
> problem is that a under various conditions PostgreSQL may need to
> write and fsync a lot of dirty pages it has accumulated in a short
> time.  That has an "avalanche" effect, creating a "write glut"
> which can stall all I/O for a period of many seconds up to a few
> minutes.  If the OS was aware of the dirty pages pending write in
> the application, and counted those for purposes of calculating when
> and how much to write, the glut could be avoided.  Currently,
> people configure the PostgreSQL background writer to be very
> aggressive, configure a small PostgreSQL shared_buffers setting,
> and/or set the OS thresholds low enough to minimize the problem;
> but all of these mitigation strategies have their own costs.
>
> A new hint that the application has dirtied a page could be used by
> the OS to improve things this way:  When the OS is notified that a
> page is dirty, it takes action depending on whether the page is
> considered dirty by the OS.  If it is not dirty, the page is
> immediately discarded from the OS cache.  It is known that the
> application has a modified version of the page that it intends to
> write, so the version in the OS cache has no value.  We don't want
> this page forcing eviction of vrange()-flagged pages.  If it is
> dirty, any write ordering to storage by the OS based on when the
> page was written to the OS would be pushed back as far as possible
> without crossing any write barriers, in hopes that the writes could
> be combined.  Either way, this page is counted toward dirty pages
> for purposes of calculating how much to write from the OS to
> storage, and the later write of the page doesn't redundantly add to
> this number.
The evict if clean part is easy. That could be easily a new fadvise()
option - btw. note that POSIX_FADV_DONTNEED has quite close meaning. Only
that it also starts writeback on a dirty page if backing device isn't
congested. Which is somewhat contrary to what you want to achieve. But I'm
not sure the eviction would be a clear win since filesystem then has to
re-create the mapping from logical file block to disk block (it is cached
in the page) and that potentially needs to go to disk to fetch the mapping
data.

I have a hard time thinking how we would implement pushing back writeback
of a particular page (or better set of pages). When we need to write pages
because we are nearing dirty_bytes limit, we likely want to write these
marked pages anyway to make as many pages freeable as possible. So the only
thing we could do is to ignore these pages during periodic writeback and
I'm not sure that would make a big difference.

Just to get some idea about the sizes - how large are the checkpoints we
are talking about that cause IO stalls?

Honza

--
Jan Kara <jack(at)suse(dot)cz>
SUSE Labs, CR

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Claudio Freire 2014-01-14 18:43:03 Re: [Lsf-pc] Linux kernel impact on PostgreSQL performance
Previous Message Claudio Freire 2014-01-14 18:32:12 Re: [Lsf-pc] Linux kernel impact on PostgreSQL performance