Re: Linux kernel impact on PostgreSQL performance

From: Jeff Janes <jeff(dot)janes(at)gmail(dot)com>
To: Mel Gorman <mgorman(at)suse(dot)de>
Cc: Claudio Freire <klaussfreire(at)gmail(dot)com>, Jim Nasby <jim(at)nasby(dot)net>, Robert Haas <robertmhaas(at)gmail(dot)com>, Kevin Grittner <kgrittn(at)ymail(dot)com>, Josh Berkus <josh(at)agliodbs(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>, Joshua Drake <jd(at)commandprompt(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Magnus Hagander <magnus(at)hagander(dot)net>, "lsf-pc(at)lists(dot)linux-foundation(dot)org" <lsf-pc(at)lists(dot)linux-foundation(dot)org>
Subject: Re: Linux kernel impact on PostgreSQL performance
Date: 2014-01-17 00:30:59
Message-ID: CAMkU=1ywpD7e0N04_5A0davq1OjKTYmR-s_18KYeUnDVejgQBg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, Jan 15, 2014 at 2:08 AM, Mel Gorman <mgorman(at)suse(dot)de> wrote:

> On Tue, Jan 14, 2014 at 09:30:19AM -0800, Jeff Janes wrote:
> > >
> > > That could be something we look at. There are cases buried deep in the
> > > VM where pages get shuffled to the end of the LRU and get tagged for
> > > reclaim as soon as possible. Maybe you need access to something like
> > > that via posix_fadvise to say "reclaim this page if you need memory but
> > > leave it resident if there is no memory pressure" or something similar.
> > > Not exactly sure what that interface would look like or offhand how it
> > > could be reliably implemented.
> > >
> >
> > I think the "reclaim this page if you need memory but leave it resident
> if
> > there is no memory pressure" hint would be more useful for temporary
> > working files than for what was being discussed above (shared buffers).
> > When I do work that needs large temporary files, I often see physical
> > write IO spike but physical read IO does not. I interpret that to mean
> > that the temporary data is being written to disk to satisfy either
> > dirty_expire_centisecs or dirty_*bytes, but the data remains in the FS
> > cache and so disk reads are not needed to satisfy it. So a hint that
> says
> > "this file will never be fsynced so please ignore dirty_*bytes and
> > dirty_expire_centisecs.
>
> It would be good to know if dirty_expire_centisecs or dirty ratio|bytes
> were the problem here.

Is there an easy way to tell? I would guess it has to be at least
dirty_expire_centisecs, if not both, as a very large sort operation takes a
lot more than 30 seconds to complete.

> An interface that forces a dirty page to stay dirty
> regardless of the global system would be a major hazard. It potentially
> allows the creator of the temporary file to stall all other processes
> dirtying pages for an unbounded period of time.

Are the dirty ratio/bytes limits the mechanisms by which adequate clean
memory is maintained? I thought those were there just to but a limit on
long it would take to execute a sync call should one be issued, and there
were other setting which said how much clean memory to maintain. It should
definitely write out the pages if it needs the memory for other things,
just not write them out due to fear of how long it would take to sync it if
a sync was called. (And if it needs the memory, it should be able to write
it out quickly as the writes would be mostly sequential, not
random--although how the kernel can believe me that that will always be the
case could a problem)

> I proposed in another part
> of the thread a hint for open inodes to have the background writer thread
> ignore dirty pages belonging to that inode. Dirty limits and fsync would
> still be obeyed. It might also be workable for temporary files but the
> proposal could be full of holes.
>

If calling fsync would fail with an error, would that lower the risk of DoS?

>
> Your alternative here is to create a private anonymous mapping as they
> are not subject to dirty limits. This is only a sensible option if the
> temporarily data is guaranteeed to be relatively small. If the shared
> buffers, page cache and your temporary data exceed the size of RAM then
> data will get discarded or your temporary data will get pushed to swap
> and performance will hit the floor.
>

PostgreSQL mainly uses temp files precisely when that gaurantee is hard to
make. There is a pretty big margin where it is too big to be certain it
will fit in memory, so we have to switch to a disk-friendly
mostly-sequential algorithm. Yet it would still be nice to avoid the
actual disk writes until we have observed that it actually is growing too
big.

Cheers,

Jeff

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Dave Chinner 2014-01-17 00:31:20 Re: [Lsf-pc] Linux kernel impact on PostgreSQL performance
Previous Message Jeff Janes 2014-01-16 23:58:56 Re: [Lsf-pc] Linux kernel impact on PostgreSQL performance