Re: Wierd context-switching issue on Xeon

From: Joe Conway <mail(at)joeconway(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: josh(at)agliodbs(dot)com, "scott(dot)marlowe" <scott(dot)marlowe(at)ihs(dot)com>, Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>, lutzeb(at)aeccom(dot)com, pgsql-performance(at)postgresql(dot)org, Neil Conway <neilc(at)samurai(dot)com>
Subject: Re: Wierd context-switching issue on Xeon
Date: 2004-04-20 03:00:05
Message-ID: 40849235.2070808@joeconway.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance

Tom Lane wrote:
> Here is a test case. To set up, run the "test_setup.sql" script once;
> then launch two copies of the "test_run.sql" script. (For those of
> you with more than two CPUs, see whether you need one per CPU to make
> trouble, or whether two test_runs are enough.) Check that you get a
> nestloops-with-index-scans plan shown by the EXPLAIN in test_run.

Check.

> In isolation, test_run.sql should do essentially no syscalls at all once
> it's past the initial ramp-up. On a machine that's functioning per
> expectations, multiple copies of test_run show a relatively low rate of
> semop() calls --- a few per second, at most --- and maybe a delaying
> select() here and there.
>
> What I actually see on Josh's client's machine is a context swap storm:
> "vmstat 1" shows CS rates around 170K/sec. strace'ing the backends
> shows a corresponding rate of semop() syscalls, with a few delaying
> select()s sprinkled in. top(1) shows system CPU percent of 25-30
> and idle CPU percent of 16-20.

Your test case works perfectly. I ran 4 concurrent psql sessions, on a
quad Xeon (IBM x445, 2.8GHz, 4GB RAM), hyperthreaded. Heres what 'top'
looks like:

177 processes: 173 sleeping, 3 running, 1 zombie, 0 stopped
CPU states: cpu user nice system irq softirq iowait idle
total 35.9% 0.0% 7.2% 0.0% 0.0% 0.0% 56.8%
cpu00 19.6% 0.0% 4.9% 0.0% 0.0% 0.0% 75.4%
cpu01 44.1% 0.0% 7.8% 0.0% 0.0% 0.0% 48.0%
cpu02 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 100.0%
cpu03 32.3% 0.0% 13.7% 0.0% 0.0% 0.0% 53.9%
cpu04 21.5% 0.0% 10.7% 0.0% 0.0% 0.0% 67.6%
cpu05 42.1% 0.0% 9.8% 0.0% 0.0% 0.0% 48.0%
cpu06 100.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0%
cpu07 27.4% 0.0% 10.7% 0.0% 0.0% 0.0% 61.7%
Mem: 4123700k av, 3933896k used, 189804k free, 0k shrd, 221948k buff
2492124k actv, 760612k in_d, 41416k in_c
Swap: 2040244k av, 5632k used, 2034612k free 3113272k cached

Note that cpu06 is not a postgres process. The output of vmstat looks
like this:

# vmstat 1
procs memory swap io system
cpu
r b swpd free buff cache si so bi bo in cs us sy id wa
4 0 5632 184264 221948 3113308 0 0 0 0 0 0 0 0 0 0
3 0 5632 184264 221948 3113308 0 0 0 0 112 211894 36 9 55 0
5 0 5632 184264 221948 3113308 0 0 0 0 125 222071 39 8 53 0
4 0 5632 184264 221948 3113308 0 0 0 0 110 215097 39 10 52 0
1 0 5632 184588 221948 3113308 0 0 0 96 139 187561 35 10 55 0
3 0 5632 184588 221948 3113308 0 0 0 0 114 241731 38 10 52 0
3 0 5632 184920 221948 3113308 0 0 0 0 132 257168 40 9 51 0
1 0 5632 184912 221948 3113308 0 0 0 0 114 251802 38 9 54 0

> Note the test case assumes you've got shared_buffers set to at least
> 1000; with smaller values, you may get some I/O syscalls, which will
> probably skew the results.

shared_buffers
----------------
16384
(1 row)

I found that killing three of the four concurrent queries dropped
context switches to about 70,000 to 100,000. Two or more sessions brings
it up to 200K+.

Joe

In response to

Responses

Browse pgsql-performance by date

  From Date Subject
Next Message Shea,Dan [CIS] 2004-04-20 03:37:47 Why will vacuum not end?
Previous Message Tom Lane 2004-04-20 00:53:09 Re: Wierd context-switching issue on Xeon