Re: Linux kernel impact on PostgreSQL performance

From: Mel Gorman <mgorman(at)suse(dot)de>
To: Josh Berkus <josh(at)agliodbs(dot)com>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Kevin Grittner <kgrittn(at)ymail(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>, Joshua Drake <jd(at)commandprompt(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Magnus Hagander <magnus(at)hagander(dot)net>, "lsf-pc(at)lists(dot)linux-foundation(dot)org" <lsf-pc(at)lists(dot)linux-foundation(dot)org>
Subject: Re: Linux kernel impact on PostgreSQL performance
Date: 2014-01-14 10:21:43
Message-ID: 20140114102143.GA4963@suse.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Mon, Jan 13, 2014 at 03:24:38PM -0800, Josh Berkus wrote:
> On 01/13/2014 02:26 PM, Mel Gorman wrote:
> > Really?
> >
> > zone_reclaim_mode is often a complete disaster unless the workload is
> > partitioned to fit within NUMA nodes. On older kernels enabling it would
> > sometimes cause massive stalls. I'm actually very surprised to hear it
> > fixes anything and would be interested in hearing more about what sort
> > of circumstnaces would convince you to enable that thing.
>
> So the problem with the default setting is that it pretty much isolates
> all FS cache for PostgreSQL to whichever socket the postmaster is
> running on, and makes the other FS cache unavailable.

I'm not being pedantic but the default depends on the NUMA characteristics of
the machine so I need to know if it was enabled or disabled. Some machines
will default zone_reclaim_mode to 0 and others will default it to 1. In my
experience the majority of bugs that involved zone_reclaim_mode were due
to zone_reclaim_mode enabled by default. If I see a bug that involves
a file-based workload on a NUMA machine with stalls and/or excessive IO
when there is plenty of memory free then zone_reclaim_mode is the first
thing I check.

I'm guessing from context that in your experience it gets enabled by default
on the machines you care about. This would indeed limit FS cache usage to
the node where the process is initiating IO (postmaster I guess).

> This means that,
> for example, if you have two memory banks, then only one of them is
> available for PostgreSQL filesystem caching ... essentially cutting your
> available cache in half.
>
> And however slow moving cached pages between memory banks is, it's an
> order of magnitude faster than moving them from disk. But this isn't
> how the NUMA stuff is configured; it seems to assume that it's less
> expensive to get pages from disk than to move them between banks, so

Yes, this is right. The history behind this "logic" is that it was assumed
NUMA machines would only ever be used for HPC and that the workloads would
always be partitioned to run within NUMA nodes. This has not been the case
for a long time and I would argue that we should leave that thing disabled
by default in all cases. Last time I tried it was met with resistance but
maybe it's time to try again.

> whatever you've got cached on the other bank, it flushes it to disk as
> fast as possible. I understand the goal was to make memory usage local
> to the processors stuff was running on, but that includes an implicit
> assumption that no individual process will ever want more than one
> memory bank worth of cache.
>
> So disabling all of the NUMA optimizations is the way to go for any
> workload I personally deal with.
>

I would hesitate to recommend "all" on the grounds that zone_reclaim_mode
is brain damage and I'd hate to lump all tuning parameters into the same box.

There is an interesting side-line here. If all IO is initiated by one
process in postgres then the memory locality will be sub-optimal.
The consumer of the data may or may not be running on the same
node as the process that read the data from disk. It is possible to
migrate this from user space but the interface is clumsy and assumes the
data is mapped.

Automatic NUMA balancing does not help you here because that thing also
depends on the data being mapped. It does nothing for data accessed via
read/write. There is nothing fundamental that prevents this, it was not
implemented because it was not deemed to be important enough. The amount
of effort spent on addressing this would depend on how important NUMA
locality is for postgres performance.

--
Mel Gorman
SUSE Labs

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Heikki Linnakangas 2014-01-14 10:43:24 Re: INSERT...ON DUPLICATE KEY LOCK FOR UPDATE
Previous Message Peter Geoghegan 2014-01-14 10:20:35 Re: INSERT...ON DUPLICATE KEY LOCK FOR UPDATE