Re: [Lsf-pc] Linux kernel impact on PostgreSQL performance

From: Jeff Janes <jeff(dot)janes(at)gmail(dot)com>
To: Dave Chinner <david(at)fromorbit(dot)com>
Cc: Jim Nasby <jim(at)nasby(dot)net>, Claudio Freire <klaussfreire(at)gmail(dot)com>, Josh Berkus <josh(at)agliodbs(dot)com>, Kevin Grittner <kgrittn(at)ymail(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>, Joshua Drake <jd(at)commandprompt(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Mel Gorman <mgorman(at)suse(dot)de>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, "lsf-pc(at)lists(dot)linux-foundation(dot)org" <lsf-pc(at)lists(dot)linux-foundation(dot)org>, Magnus Hagander <magnus(at)hagander(dot)net>
Subject: Re: [Lsf-pc] Linux kernel impact on PostgreSQL performance
Date: 2014-01-17 03:37:04
Message-ID: CAMkU=1zSw-FYw0KKOZM=E1iTZRrt8QN778LehM8ZWVWJ0TTdeQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Thursday, January 16, 2014, Dave Chinner
<david(at)fromorbit(dot)com<javascript:_e({}, 'cvml',
'david(at)fromorbit(dot)com');>>
wrote:

> On Thu, Jan 16, 2014 at 03:58:56PM -0800, Jeff Janes wrote:
> > On Thu, Jan 16, 2014 at 3:23 PM, Dave Chinner <david(at)fromorbit(dot)com>
> wrote:
> >
> > > On Wed, Jan 15, 2014 at 06:14:18PM -0600, Jim Nasby wrote:
> > > > On 1/15/14, 12:00 AM, Claudio Freire wrote:
> > > > >My completely unproven theory is that swapping is overwhelmed by
> > > > >near-misses. Ie: a process touches a page, and before it's
> > > > >actually swapped in, another process touches it too, blocking on
> > > > >the other process' read. But the second process doesn't account
> > > > >for that page when evaluating predictive models (ie: read-ahead),
> > > > >so the next I/O by process 2 is unexpected to the kernel. Then
> > > > >the same with 1. Etc... In essence, swap, by a fluke of its
> > > > >implementation, fails utterly to predict the I/O pattern, and
> > > > >results in far sub-optimal reads.
> > > > >
> > > > >Explicit I/O is free from that effect, all read calls are
> > > > >accountable, and that makes a difference.
> > > > >
> > > > >Maybe, if the kernel could be fixed in that respect, you could
> > > > >consider mmap'd files as a suitable form of temporary storage.
> > > > >But that would depend on the success and availability of such a
> > > > >fix/patch.
> > > >
> > > > Another option is to consider some of the more "radical" ideas in
> > > > this thread, but only for temporary data. Our write sequencing and
> > > > other needs are far less stringent for this stuff. -- Jim C.
> > >
> > > I suspect that a lot of the temporary data issues can be solved by
> > > using tmpfs for temporary files....
> > >
> >
> > Temp files can collectively reach hundreds of gigs.
>
> So unless you have terabytes of RAM you're going to have to write
> them back to disk.
>

If they turn out to be hundreds of gigs, then yes they have to hit disk (at
least on my hardware). But if they are 10 gig, then maybe not (depending
on whether other people decide to do similar things at the same time I'm
going to be doing it--something which is often hard to predict). But now
for every action I take, I have to decide, is this going to take 10 gig, or
14 gig, and how absolutely certain am I? And is someone else going to try
something similar at the same time? What a hassle. It would be so much
nicer to say "This is accessed sequentially, and will never be fsynced.
Maybe it will fit entirely in memory, maybe it won't, either way, you know
what to do."

If I start out writing to tmpfs, I can't very easily change my mind 94% of
the way through and decide to go somewhere else. But the kernel,
effectively, can.

> But there's something here that I'm not getting - you're talking
> about a data set that you want ot keep cache resident that is at
> least an order of magnitude larger than the cyclic 5-15 minute WAL
> dataset that ongoing operations need to manage to avoid IO storms.
>

Those are mostly orthogonal issues. The permanent files need to be fsynced
on a regular basis, and might have gigabytes of data dirtied at random from
within terabytes of underlying storage. We better start writing that
pretty quickly or when do issue the fsyncs, the world will fall apart.

The temporary files will never need to be fsynced, and can be written out
sequentially if they do ever need to be written out. Better to delay this
as much as feasible.

Where do these temporary files fit into this picture, how fast do
> they grow and why are do they need to be so large in comparison to
> the ongoing modifications being made to the database?
>

The permanent files tend to be things like "Jane Doe just bought a pair of
green shoes from Hendrick Green Shoes Limited--record that, charge her
credit card, and schedule delivery". The temp files are more like "It is
the end of the year, how many shoes have been purchased in each color from
each manufacturer for each quarter over the last 6 years"? So the temp
files quickly manipulate data that has slowly been accumulating over very
long times, while the permanent files represent the processes of those
accumulations.

If you are Amazon, of course, you have thousands of people who can keep two
sets of records, one organized for fast update and one slightly delayed
copy reorganized for fast analysis, and also do partial analysis on an
ongoing basis and roll them up in ways that can be incrementally updated.
If you are not Amazon, it would be nice if one system did a better job of
doing both things with the trade off between the two being dynamic and
automatic.

Cheers,

Jeff

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Amit Kapila 2014-01-17 03:46:50 Re: Heavily modified big table bloat even in auto vacuum is running
Previous Message Craig Ringer 2014-01-17 03:19:49 Re: WAL Rate Limiting