Re: Why we are going to have to go DirectIO

From: Jonathan Corbet <corbet(at)lwn(dot)net>
To: Josh Berkus <josh(at)agliodbs(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Why we are going to have to go DirectIO
Date: 2013-12-04 20:31:39
Message-ID: 20131204133139.5dad25c9@lwn.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, 04 Dec 2013 11:07:04 -0800
Josh Berkus <josh(at)agliodbs(dot)com> wrote:

> On 12/04/2013 07:33 AM, Jonathan Corbet wrote:
> > Wow, Josh, I'm surprised to hear this from you.
>
> Well, I figured it was too angry to propose for an LWN article. ;-)

So you're going to make us write it for you :)

> > The active/inactive list mechanism works great for the vast majority of
> > users. The second-use algorithm prevents a lot of pathological behavior,
> > like wiping out your entire cache by copying a big file or running a
> > backup. We *need* that kind of logic in the kernel.
>
> There's a large body of research on 2Q algorithms going back to the 80s,
> which is what this is. As far as I can tell, the modification was
> performed without any reading of this research, since that would have
> easily shown that 50/50 was unlikely to be a good division, and that in
> fact there is nothing which would work except a tunable setting, because
> workloads are different.

In general, the movement of useful information between academia and
real-world programming seems to be minimal at best. Neither side seems to
find much that is useful or interesting in what the other is doing.
Unfortunate.

For those interested in the details... (1) It's not quite 50/50, that's one
bound for how the balance is allowed to go. (2) Anybody trying to add
tunables to the kernel tends to run into resistance. Exposing thousands of
knobs tends to lead to a situation where you *have* to be an expert on all
those knobs to get decent behavior out of your system. So there is a big
emphasis on having the kernel tune itself whenever possible. Here is a
situation where that is not always happening, but a fix (which introduces
no knob) is in the works.

As an example, I've never done much with the PostgreSQL knobs on the LWN
server. I just don't have the time to mess with it, and things Work Well
Enough.

</irrelevant_aside>

> However, this particular issue concerns me less than the general
> attitude that it's OK to push in experimental IO changes which can't be
> disabled by users into release kernels, as exemplified by several
> problematic and inadequately tested IO changes in the 3.X kernels --
> most notably the pdflush bug. It speaks of a policy that the Linux IO
> stack is not production software, and it's OK to tinker with it in ways
> that break things for many users.

Bugs and regressions happen, and I won't say that we do a good enough job
in that regard. There has been some concern recently that we're accepting
too much marginal stuff. We have problems getting enough people to
adequately review code — I think I've heard of another project or two with
similar issues :). But nobody sees the kernel as experimental or feels
that the introduction of bugs is an acceptable thing.

> I also wasn't exaggerating the reception I got when I tried to talk
> about IO and PostgreSQL at LinuxCon and other events. The majority of
> Linux hackers I've talked to simply don't want to be bothered with
> PostgreSQL's performance needs, and I've heard similar things from my
> collegues at the MySQL variants. Greg KH was the only real exception.
>
> Heck, I went to a meeting of filesystem geeks at LinuxCon and the main
> feedback I received, from Linux FS developers (Chris and Ted), was
> "PostgreSQL should implement its own storage and use DirectIO, we don't
> know why you're even trying to use the Linux IO stack."

I think you're talking to the wrong people. Nothing you've described is a
filesystem problem; you're contending with memory management problems.
Chris and Ted weren't helpful because there's actually little they can do
to help you. I would be happy to introduce you to some people who would be
more likely to take your problems to heart.

Mel Gorman, for example, is working on putting together a set of MM
benchmarks in the hopes of quantifying changes and catching regressions
before new code is merged. He's one of the people who has to deal with
performance regressions when they show up in enterprise kernels, and I get
the sense he'd rather do less of that.

Perhaps even better: the next filesystem, storage, and memory management
summit is March 24-25. A session on your pain points there would bring in
a substantial portion of the relevant developers at all levels. LSFMM
is arguably the most productive kernel event I see over the course of a
year; it's where I would go first to make progress on this issue. I'm not
an LSFMM organizer, but I would be happy to work to make such a session
happen if somebody from the PostgreSQL community wanted to be there.

> > This code has been a bit slow getting into the mainline for a few reasons,
> > but one of the chief ones is this: nobody is saying from the sidelines
> > that they need it! If somebody were saying "Postgres would work a lot
> > better with this code in place" and had some numbers to demonstrate that,
> > we'd be far more likely to see it get into an upcoming release.
>
> Well, Citus did that; do you need more evidence?

Yes, they did that — one week ago. This patch has been in the works for
almost two years. And Citus has not taken anything to the kernel
community, so somebody else will have to do that for them. I might be able
to help in that regard.

> In addition to testing, though, I have yet to find a way to learn about
> new changes to IO or memory performance in the Linux Kernel without
> reading all of the traffic on LKML and all Linux commit messages and
> filtering them myself. If there were a better way to look for this
> information, Linux would be more likely to get feedback in a timely
> fashion. And yeah, I know that Postgres has the same issue.

Gee, if only there were a web site where one could read about changes to
the Linux kernel :)

Seriously, though, one of the best things to do would be to make a point of
picking up a kernel around -rc3 (right around now, say, for 3.13) and
running a few benchmarks on it. If you report a performance regression at
that stage, it will get attention.

Thanks,

jon

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Magnus Hagander 2013-12-04 20:47:30 Re: Why we are going to have to go DirectIO
Previous Message Robert Haas 2013-12-04 20:28:21 Re: Extension Templates S03E11