Re: Two Necessary Kernel Tweaks for Linux Systems

From: Merlin Moncure <mmoncure(at)gmail(dot)com>
To: sthomas(at)optionshouse(dot)com
Cc: "pgsql-performance(at)postgresql(dot)org" <pgsql-performance(at)postgresql(dot)org>, Vlad <marchenko(at)gmail(dot)com>
Subject: Re: Two Necessary Kernel Tweaks for Linux Systems
Date: 2013-01-07 19:22:13
Message-ID: CAHyXU0xpdpg0uBvrTO-r=X=Kf--QYL5sHV=JGeDy+TP+yGHD=w@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance

On Wed, Jan 2, 2013 at 3:46 PM, Shaun Thomas <sthomas(at)optionshouse(dot)com> wrote:
> Hey everyone!
>
> After much testing and hair-pulling, we've confirmed two kernel settings
> that should always be modified in production Linux systems. Especially new
> ones with the completely fair scheduler (CFS) as opposed to the O(1)
> scheduler.
>
> If you want to follow along, these are:
>
> /proc/sys/kernel/sched_migration_cost
> /proc/sys/kernel/sched_autogroup_enabled
>
> Which correspond to sysctl settings:
>
> kernel.sched_migration_cost
> kernel.sched_autogroup_enabled
>
> What do these settings do?
> --------------------------
>
> * sched_migration_cost
>
> The migration cost is the total time the scheduler will consider a migrated
> process "cache hot" and thus less likely to be re-migrated. By default, this
> is 0.5ms (500000 ns), and as the size of the process table increases,
> eventually causes the scheduler to break down. On our systems, after a
> smooth degradation with increasing connection count, system CPU spiked from
> 20 to 70% sustained and TPS was cut by 5-10x once we crossed some invisible
> connection count threshold. For us, that was a pgbench with 900 or more
> clients.
>
> The migration cost should be increased, almost universally on server systems
> with many processes. This means systems like PostgreSQL or Apache would
> benefit from having higher migration costs. We've had good luck with a
> setting of 5ms (5000000 ns) instead.
>
> When the breakdown occurs, system CPU (as obtained from sar) increases from
> 20% on a heavy pgbench (scale 3500 on a 72GB system) to over 70%, and
> %nice/%user is cut by half or more. A higher migration cost essentially
> eliminates this artificial throttle.
>
> * sched_autogroup_enabled
>
> This is a relatively new patch which Linus lauded back in late 2010. It
> basically groups tasks by TTY so perceived responsiveness is improved. But
> on server systems, large daemons like PostgreSQL are going to be launched
> from the same pseudo-TTY, and be effectively choked out of CPU cycles in
> favor of less important tasks.
>
> The default setting is 1 (enabled) on some platforms. By setting this to 0
> (disabled), we saw an outright 30% performance boost on the same pgbench
> test. A fully cached scale 3500 database on a 72GB system went from 67k TPS
> to 82k TPS with 900 client connections.
>
> Total Benefit
> -------------
>
> At higher connections counts, such as systems that can't use pooling or make
> extensive use of prepared queries, these can massively affect performance.
> At 900 connections, our test systems were at 17k TPS unaltered, but 85k TPS
> after these two modifications. Even with this performance boost, we still
> had 40% CPU free instead of 0%. In effect, the logarithmic performance of
> the new scheduler is returned to normal under large process tables.
>
> Some systems will have a higher "cracking" point than others. The effect is
> amplified when a system is under high memory pressure, hence a lot of
> expensive queries on a high number of concurrent connections is the easiest
> way to replicate these results.
>
> Admins migrating from older systems (RHEL 5.x) may find this especially
> shocking, because the old O(1) scheduler was too "stupid" to have these
> advanced features, hence it was impossible to cause this kind of behavior.
>
> There's probably still a little room for improvement here, since 30-40% CPU
> is still unclaimed in our larger tests. I'd like to see the total
> performance drop (175k ideal TPS at 24-connections) decreased. But these
> kernel tweaks are rarely discussed anywhere, it seems. There doesn't seem to
> be any consensus on how these (and other) scheduler settings should be
> modified under different usage scenarios.
>
> I just figured I'd share, since we found this info so beneficial.

This is fantastic info.

Vlad, you might want to check this out and see if it has any impact in
your high cpu case...via:
http://postgresql.1045698.n5.nabble.com/High-SYS-CPU-need-advise-td5732045.html

merlin

In response to

Browse pgsql-performance by date

  From Date Subject
Next Message james 2013-01-07 21:49:42 Forcing WAL flush
Previous Message Claudio Freire 2013-01-07 18:48:12 Re: [PERFORM] Slow query: bitmap scan troubles