Re: Optimizing Database High CPU

From: Scottix <scottix(at)gmail(dot)com>
To: Jeff Janes <jeff(dot)janes(at)gmail(dot)com>
Cc: Michael Lewis <mlewis(at)entrata(dot)com>, Postgres General <pgsql-general(at)postgresql(dot)org>
Subject: Re: Optimizing Database High CPU
Date: 2019-05-10 20:26:08
Message-ID: CANKFHZ-KVt08zRkbZnLJHprMnLLgVdLejHybECvLWCDormO1Xg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

Hey,
So I finally found the culprit. Turns out to be the THP fighting with
itself.

After running on Ubuntu
echo never > /sys/kernel/mm/transparent_hugepage/enabled
echo never > /sys/kernel/mm/transparent_hugepage/defrag

It instantly went from a loadavg of 30 to 3

Also make sure you re-enable on reboot.

Anyway just wanted to give a followup on the issue incase anyone else is
having the same problem.

On Mon, Mar 4, 2019 at 12:03 PM Jeff Janes <jeff(dot)janes(at)gmail(dot)com> wrote:

> On Wed, Feb 27, 2019 at 5:01 PM Michael Lewis <mlewis(at)entrata(dot)com> wrote:
>
>> If those 50-100 connections are all active at once, yes, that is high.
>>> They can easily spend more time fighting each other over LWLocks,
>>> spinlocks, or cachelines rather than doing useful work. This can be
>>> exacerbated when you have multiple sockets rather than all cores in a
>>> single socket. And these problems are likely to present as high Sys times.
>>>
>>> Perhaps you can put up a connection pooler which will allow 100
>>> connections to all think they are connected at once, but forces only 12 or
>>> so to actually be active at one time, making the others transparently queue.
>>>
>>
>> Can you expound on this or refer me to someplace to read up on this?
>>
>
> Just based on my own experimentation. This is not a blanket
> recommendation, but specific to the situation that we already suspect
> there is contention, and the server is too old to have pg_stat_actvity.wait_event
> column.
>
>
>> Context, I don't want to thread jack though: I think I am seeing similar
>> behavior in our environment at times with queries that normally take
>> seconds taking 5+ minutes at times of high load. I see many queries showing
>> buffer_mapping as the LwLock type in snapshots but don't know if that may
>> be expected.
>>
>
> It sounds like your processes are fighting to reserve buffers in
> shared_buffers in which to read data pages. But those data pages are
> probably already in the OS page cache, otherwise reading it from disk would
> be slow enough that you would be seeing some type of IO wait, or buffer_io,
> rather than buffer_mapping as the dominant wait type. So I think that
> means you have most of your data in RAM, but not enough of it in
> shared_buffers. You might be in a rare situation where setting
> shared_buffers to a high fraction of RAM, rather than the usual low
> fraction, is called for. Increasing NUM_BUFFER_PARTITIONS might also be
> useful, but that requires a recompilation of the server. But do these
> spikes correlate with anything known at the application level? A change in
> the mix of queries, or a long report or maintenance operation? Maybe the
> query plans briefly toggle over to using seq scans rather than index scans
> or vice versa, which drastically changes the block access patterns?
>
>
>> In our environment PgBouncer will accept several hundred connections and
>> allow up to 100 at a time to be active on the database which are VMs with
>> ~16 CPUs allocated (some more, some less, multi-tenant and manually
>> sharded). It sounds like you are advocating for connection max very close
>> to the number of cores. I'd like to better understand the pros/cons of that
>> decision.
>>
>
> There are good reasons to allow more than that. For example, your
> application holds some transactions open briefly while it does some
> cogitation on the application-side, rather than immediately committing and
> so returning the connection to the connection pool. Or your server has a
> very high IO capacity and benefits from lots of read requests in the queue
> at the same time, so it can keep every spindle busy and every rotation
> productive. But, if you have no reason to believe that any of those
> situations apply to you, but do have evidence that you have lock contention
> between processes, then I think that limiting the number active processes
> to the number of cores is a good starting point.
>
> Cheers,
>
> Jeff
>

--
T: @Thaumion
IG: Thaumion
Scottix(at)Gmail(dot)com

In response to

Browse pgsql-general by date

  From Date Subject
Next Message neeraj kumar 2019-05-10 20:32:54 Re: Query on pg_stat_activity table got stuck
Previous Message Erik Jones 2019-05-10 19:46:58 Re: Hot Standby Conflict on pg_attribute