Re: [Lsf-pc] Linux kernel impact on PostgreSQL performance

From: Jim Nasby <jim(at)nasby(dot)net>
To: Dave Chinner <david(at)fromorbit(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Mel Gorman <mgorman(at)suse(dot)de>, Josh Berkus <josh(at)agliodbs(dot)com>, Kevin Grittner <kgrittn(at)ymail(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>, Joshua Drake <jd(at)commandprompt(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, "lsf-pc(at)lists(dot)linux-foundation(dot)org" <lsf-pc(at)lists(dot)linux-foundation(dot)org>, Magnus Hagander <magnus(at)hagander(dot)net>
Subject: Re: [Lsf-pc] Linux kernel impact on PostgreSQL performance
Date: 2014-01-15 03:54:20
Message-ID: 52D6066C.9020100@nasby.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 1/14/14, 3:41 PM, Dave Chinner wrote:
> On Tue, Jan 14, 2014 at 09:40:48AM -0500, Robert Haas wrote:
>> On Mon, Jan 13, 2014 at 5:26 PM, Mel Gorman <mgorman(at)suse(dot)de> wrote:
> IOWs, using sync_file_range() does not avoid the need to fsync() a
> file for data integrity purposes...

I belive the PG community understands that, but thanks for the heads-up.

>> Whether the problem is with the system
>> call or the programmer is harder to determine. I think the problem is
>> in part that it's not exactly clear when we should call it. So
>> suppose we want to do a checkpoint. What we used to do a long time
>> ago is write everything, and then fsync it all, and then call it good.
>> But that produced horrible I/O storms. So what we do now is do the
>> writes over a period of time, with sleeps in between, and then fsync
>> it all at the end, hoping that the kernel will write some of it before
>> the fsyncs arrive so that we don't get a huge I/O spike.
>> And that sorta works, and it's definitely better than doing it all at
>> full speed, but it's pretty imprecise. If the kernel doesn't write
>> enough of the data out in advance, then there's still a huge I/O storm
>> when we do the fsyncs and everything grinds to a halt. If it writes
>> out more data than needed in advance, it increases the total number of
>> physical writes because we get less write-combining, and that hurts
>> performance, too.

I think there's a pretty important bit that Robert didn't mention: we have a specific *time* target for when we want all the fsync's to complete. People that have problems here tend to tune checkpoints to complete every 5-15 minutes, and they want the write traffic for the checkpoint spread out over 90% of that time interval. To put it another way, fsync's should be done when 90% of the time to the next checkpoint hits, but preferably not a lot before then.

> Yup, the kernel defaults to maximising bulk write throughput, which
> means it waits to the last possible moment to issue write IO. And
> that's exactly to maximise write combining, optimise delayed
> allocation, etc. There are many good reasons for doing this, and for
> the majority of workloads it is the right behaviour to have.
>
> It sounds to me like you want the kernel to start background
> writeback earlier so that it doesn't build up as much dirty data
> before you require a flush. There are several ways to do this by
> tweaking writeback knobs. The simplest is probably just to set
> /proc/sys/vm/dirty_background_bytes to an appropriate threshold (say
> 50MB) and dirty_expire_centiseconds to a few seconds so that
> background writeback starts and walks all dirty inodes almost
> immediately. This will keep a steady stream of low level background
> IO going, and fsync should then not take very long.

Except that still won't throttle writes, right? That's the big issue here: our users often can't tolerate big spikes in IO latency. They want user requests to always happen within a specific amount of time.

So while delaying writes potentially reduces the total amount of data you're writing, users that run into problems here ultimately care more about ensuring that their foreground IO completes in a timely fashion.

> Fundamentally, though, we need bug reports from people seeing these
> problems when they see them so we can diagnose them on their
> systems. Trying to discuss/diagnose these problems without knowing
> anything about the storage, the kernel version, writeback
> thresholds, etc really doesn't work because we can't easily
> determine a root cause.

So is lsf-pc(at)linux-foundation(dot)org the best way to accomplish that?

Also, along the lines of collaboration, it would also be awesome to see kernel hackers at PGCon (http://pgcon.org) for further discussion of this stuff. That is the conference that has more Postgres internal developers than any other. There's a variety of different ways collaboration could happen there, so it's probably best to start a separate discussion with those from the linux community who'd be interested in attending. PGCon also directly follows BSDCan (http://bsdcan.org) at the same venue... so we could potentially kill two OS birds with one stone, so to speak... :) If there's enough interest we could potentially do a "mini Postgres/OS conference" in-between BSDCan and the formal PGCon. There's also potential for the Postgres community to sponsor attendance for kernel hackers if money is a factor.

Like I said... best to start a separate thread if there's significant interest on meeting at PGCon. :)
--
Jim C. Nasby, Data Architect jim(at)nasby(dot)net
512.569.9461 (cell) http://jim.nasby.net

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Jim Nasby 2014-01-15 04:01:39 Re: [Lsf-pc] Linux kernel impact on PostgreSQL performance
Previous Message Craig Ringer 2014-01-15 03:52:20 Re: WAL Rate Limiting