Re: BUG #13493: pl/pgsql doesn't scale with cpus (PG9.3, 9.4)

From: Andres Freund <andres(at)anarazel(dot)de>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: grb(at)skogoglandskap(dot)no, pgsql-bugs(at)postgresql(dot)org
Subject: Re: BUG #13493: pl/pgsql doesn't scale with cpus (PG9.3, 9.4)
Date: 2015-07-08 12:55:12
Message-ID: 20150708125512.GL10242@alap3.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

On 2015-07-08 11:12:38 +0200, Andres Freund wrote:
> On 2015-07-07 21:13:04 -0400, Tom Lane wrote:
> > There is some discussion going on about improving the scalability of
> > snapshot acquisition, but nothing will happen in that line before 9.6
> > at the earliest.
>
> 9.5 should be less bad at it than 9.5, at least if it's mostly read-only
> ProcArrayLock acquisitions which sounds like it should be the case here.

test 3:
master:
1 clients: 3112.7
2 clients: 6806.7
4 clients: 13441.2
8 clients: 15765.4
16 clients: 21102.2

9.4:
1 clients: 2524.2
2 clients: 5903.2
4 clients: 11756.8
8 clients: 14583.3
16 clients: 19309.2

So there's an interesting "dip" between 4 and 8 clients. A perf profile
doesn't show any actual lock contention on master. Not that surprising,
there shouldn't be any exclusive locks here.

One interesting thing in exactly such cases is to consider intel's
turboboost. Disabling it (echo 0 >
/sys/devices/system/cpu/cpufreq/boost) gives us these results:
test 3:
master:
1 clients: 2926.6
2 clients: 6634.3
4 clients: 13905.2
8 clients: 15718.9

so that's not it in this case.

comparing stats between the 4 and 8 client runs shows (removing boring data):

4 clients:
90859.517328 task-clock (msec) # 3.428 CPUs utilized
109,655,985,749 stalled-cycles-frontend # 54.27% frontend cycles idle (27.79%)
62,906,918,008 stalled-cycles-backend # 31.14% backend cycles idle (27.78%)
219,063,494,214 instructions # 1.08 insns per cycle
# 0.50 stalled cycles per insn (33.32%)
41,664,400,828 branches # 458.558 M/sec (33.32%)
374,426,805 branch-misses # 0.90% of all branches (33.32%)
62,504,845,665 L1-dcache-loads # 687.928 M/sec (27.78%)
1,224,842,848 L1-dcache-load-misses # 1.96% of all L1-dcache hits (27.81%)
321,981,924 LLC-loads # 3.544 M/sec (22.33%)
23,219,438 LLC-load-misses # 7.21% of all LL-cache hits (5.52%)

26.507528305 seconds time elapsed

8 clients:
165168.247631 task-clock (msec) # 6.824 CPUs utilized
247,231,674,170 stalled-cycles-frontend # 67.04% frontend cycles idle (27.84%)
101,354,900,788 stalled-cycles-backend # 27.48% backend cycles idle (27.83%)
285,829,642,649 instructions # 0.78 insns per cycle
# 0.86 stalled cycles per insn (33.39%)
54,503,992,461 branches # 329.991 M/sec (33.39%)
761,911,056 branch-misses # 1.40% of all branches (33.38%)
81,373,091,784 L1-dcache-loads # 492.668 M/sec (27.74%)
4,419,307,036 L1-dcache-load-misses # 5.43% of all L1-dcache hits (27.72%)
510,940,577 LLC-loads # 3.093 M/sec (21.86%)
26,963,120 LLC-load-misses # 5.28% of all LL-cache hits (5.37%)

24.205675255 seconds time elapsed

It's quite visible that all caches have considerably worse
characteristics on the 8 clients case, and that "instructions per cycle"
has gone down considerably. Presumably because more frontend cycles were
idle, which in turn is probably caused by the higher cache miss
ratios. L1 going from 1.96% misses to 5.43% misses is quite a drastic
difference.

Now, looking at where cache misses happen:
4 clients:
+ 7.64% postgres postgres [.] AllocSetAlloc
+ 3.90% postgres postgres [.] LWLockAcquire
+ 3.40% postgres plpgsql.so [.] plpgsql_exec_function
+ 2.64% postgres postgres [.] GetCachedPlan
+ 2.20% postgres postgres [.] slot_deform_tuple
+ 2.16% postgres libc-2.19.so [.] _int_free
+ 2.08% postgres libc-2.19.so [.] __memcpy_sse2_unaligned

8 clients:
+ 6.34% postgres postgres [.] AllocSetAlloc
+ 4.89% postgres plpgsql.so [.] plpgsql_exec_function
+ 2.63% postgres libc-2.19.so [.] _int_free
+ 2.60% postgres libc-2.19.so [.] __memcpy_sse2_unaligned
+ 2.50% postgres postgres [.] ExecLimit
+ 2.47% postgres postgres [.] LWLockAcquire
+ 2.18% postgres postgres [.] ExecProject

So the characteristics interestingly change quite a bit between 4/8. I
reproduced this a number of times to make sure it's not just a temporary
issue.

The memcpy rising is mainly:
+ 80.27% SearchCatCache
+ 10.56% appendBinaryStringInfo
+ 6.51% socket_putmessage
+ 0.78% pgstat_report_activity

So at least on the hardware available to me right now this isn't caused
by actual lock contention.

Hm. I've a patch addressing the SearchCatCache memcpy() cost
somewhere...

Andres

In response to

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message Andres Freund 2015-07-08 13:22:12 Re: BUG #13493: pl/pgsql doesn't scale with cpus (PG9.3, 9.4)
Previous Message atulcs178 2015-07-08 11:29:55 BUG #13494: Postgresql database displays first column data on merging of two columns in the Select statement