Re: Amazon EC2 CPU Utilization

From: Rodger Donaldson <rodger(at)diaspora(dot)gen(dot)nz>
To: Mike Bresnahan <mike(dot)bresnahan(at)bestbuy(dot)com>
Cc: pgsql-general(at)postgresql(dot)org
Subject: Re: Amazon EC2 CPU Utilization
Date: 2010-01-29 08:07:21
Message-ID: 4B629739.7080303@diaspora.gen.nz
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs pgsql-general

Mike Bresnahan wrote:
>
> I can understand that I will not get as much performance out of a EC2 instance
> as a dedicated server, but I don't understand why top(1) is showing 50% CPU
> utilization. If it were a memory speed problem wouldn't top(1) report 100% CPU
> utilization?

A couple of points:

top is not the be-all and end-all of analysis tools. I'm sure you know
that, but it bears repeating.

More importantly, in a virtualised environment the tools on the inside
of the guest don't have a full picture of what's really going on. I've
not done any real work with Xen; most of my experience is with zVM and
KVM.

It's pretty normal on a heavily loaded server to see tools like top (and
vmstat, sar, et al) reporting less than 100% use while the box is
running flat-out, leaving nothing left for the guest to get. I had this
last night doing a load on a guest - 60-70% CPU at peak, with no more
available. You *should* see steal and 0% idle time in this case, but I
*have* seen zVM Linux guests reporting ample idle time while the zVM
level monitoring tools reported the LPAR as a whole running at 90-95%
utilisation (which is when an LPAR will usually run out of steam).

A secondary effect is that sometimes the scheduling of guests on and off
the hypervisor will cause skewing in the timekeeping of the guest; it's
not uncommon in our loaded-up zVM environment to see discrepencies of
5-20% between the guest's view of how much CPU time it thinks it's
getting and how much time the hypervisor knows it's getting (this is why
companies like Velocity make money selling hypervisor-aware tools that
auto-correct those stats).

> In any case, assuming this is a EC2 memory speed thing, it is going to be
> difficult to diagnose application bottlenecks when I cannot rely on top(1)
> reporting meaningful CPU stats.

It's going to be even harder from inside the guests, since you're
getting an incomplete view of the system as a whole.

You could try the c2cbench (http://sourceforge.net/projects/c2cbench/)
which is designed to benchmark memory cache performance, but it'll still
be subject to the caveats I outlined above: it may give you something
indicative if you think it's a cache problem, but it may also simply
tell you that the virtual CPUs are fine while the real processors are
pegged for cache from running a bunch of workloads with high memory
pressure.

If you were running a newer kernel you could look at perf_counters or
something similar to get more detail from what the guest thinks it's
doing, but, again, there are going to be inaccuracies.

In response to

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message Savita 2010-01-29 11:39:16 BUG #5299: unable to start postgres service
Previous Message Fujii Masao 2010-01-29 06:54:28 Re: unable to fail over to warm standby server

Browse pgsql-general by date

  From Date Subject
Next Message Joe Kramer 2010-01-29 08:20:33 How to generate unique hash-type id?
Previous Message A. Kretschmer 2010-01-29 07:52:02 Re: Output float number with hex format