Re: pgbench - implement strict TPC-B benchmark

From: Fabien COELHO <coelho(at)cri(dot)ensmp(dot)fr>
To: Andres Freund <andres(at)anarazel(dot)de>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, "Jonah H(dot) Harris" <jonah(dot)harris(at)gmail(dot)com>, Peter Geoghegan <pg(at)bowt(dot)ie>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, PostgreSQL Developers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: pgbench - implement strict TPC-B benchmark
Date: 2019-08-02 08:34:24
Message-ID: alpine.DEB.2.21.1908012320430.32558@lancre
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers


Hello Andres,

Thanks a lot for these feedbacks and comments.

> Using pgbench -Mprepared -n -c 8 -j 8 -S pgbench_100 -T 10 -r -P1
> e.g. shows pgbench to use 189% CPU in my 4/8 core/thread laptop. That's
> a pretty significant share.

Fine, but what is the corresponding server load? 211%? 611%? And what
actual time is spent in pgbench itself, vs libpq and syscalls?

Figures and discussion below.

> And before you argue that that's just about a read-only workload:

I'm fine with worth case scenarii:-) Let's do the worse for my 2 cores
running at 2.2 GHz laptop:

(0) we can run a really do nearly nothing script:

sh> cat nope.sql
\sleep 0
# do not sleep, so stay awake…

sh> time pgbench -f nope.sql -T 10 -r
latency average = 0.000 ms
tps = 12569499.226367 (excluding connections establishing) # 12.6M
statement latencies in milliseconds:
0.000 \sleep 0
real 0m10.072s, user 0m10.027s, sys 0m0.012s

Unsurprisingly pgbench is at about 100% cpu load, and the transaction cost
(transaction loop and stat collection) is 0.080 µs (1/12.6M) per script
execution (one client on one thread).

(1) a pgbench complex-commands-only script:

sh> cat set.sql
\set x random_exponential(1, :scale * 10, 2.5) + 2.1
\set y random(1, 9) + 17.1 * :x
\set z case when :x > 7 then 1.0 / ln(:y) else 2.0 / sqrt(:y) end

sh> time pgbench -f set.sql -T 10 -r
latency average = 0.001 ms
tps = 1304989.729560 (excluding connections establishing) # 1.3M
statement latencies in milliseconds:
0.000 \set x random_exponential(1, :scale * 10, 2.5) + 2.1
0.000 \set y random(1, 9) + 17.1 * :x
0.000 \set z case when :x > 7 then 1.0 / ln(:y) else 2.0 / sqrt(:y) end
real 0m10.038s, user 0m10.003s, sys 0m0.000s

Again pgbench load is near 100%, with only pgbench stuff (thread loop,
expression evaluation, variables, stat collection) costing about 0.766 µs
cpu per script execution. This is about 10 times the previous case, 90% of
pgbench cpu cost is in expressions and variables, not a surprise.

Probably this under-a-µs could be reduced… but what overall improvements
would it provide? An answer with the last test:

(2) a ridiculously small SQL query, tested through a local unix socket:

sh> cat empty.sql
;
# yep, an empty query!

sh> time pgbench -f empty.sql -T 10 -r
latency average = 0.016 ms
tps = 62206.501709 (excluding connections establishing) # 62.2K
statement latencies in milliseconds:
0.016 ;
real 0m10.038s, user 0m1.754s, sys 0m3.867s

We are adding minimal libpq and underlying system calls to pgbench
internal cpu costs in the most favorable (or worst:-) sql query with the
most favorable postgres connection.

Apparent load is about (1.754+3.867)/10.038 = 56%, so the cpu cost per
script is 0.56 / 62206.5 = 9 µs, over 100 times the cost of a do-nothing
script (0), and over 10 times the cost of a complex expression command
script (1).

Conclusion: pgbench-specific overheads are typically (much) below 10% of
the total client-side cpu cost of a transaction, while over 90% of the cpu
cost is spent in libpq and system, for the worst case do-nothing query.

A perfect bench driver which would have zero overheads would reduce the
cpu cost by at most 10%, because you still have to talk to the database.
through the system. If pgbench cost were divided by two, which would be a
reasonable achievement, the benchmark client cost would be reduced by 5%.

Wow?

I have already given some thought in the past to optimize "pgbench",
especially to avoid long switches (eg in expression evaluation) and maybe
to improve variable management, but as show above I would not expect a
gain worth the effort and assume that a patch would probably be justly
rejected, because for a realistic benchmark script these costs are already
much less than other inevitable libpq/syscall costs.

That does not mean that nothing needs to be done, but the situation is
currently quite good.

In conclusion, ISTM that current pgbench allows to saturate a postgres
server with a client significantly smaller than the server, which seems
like a reasonable benchmarking situation. Any other driver in any other
language would necessarily incur the same kind of costs.

> [...] And the largest part of the overhead is in pgbench's interpreter
> loop:

Indeed, the figures below are very interesting! Thanks for collecting
them.

> + 12.35% pgbench pgbench [.] threadRun
> + 3.54% pgbench pgbench [.] dopr.constprop.0
> + 3.30% pgbench pgbench [.] fmtint
> + 1.93% pgbench pgbench [.] getVariable

~ 21%, probably some inlining has been performed, because I would have
expected to see significant time in "advanceConnectionState".

> + 2.95% pgbench libpq.so.5.13 [.] PQsendQueryPrepared
> + 2.15% pgbench libpq.so.5.13 [.] pqPutInt
> + 4.47% pgbench libpq.so.5.13 [.] pqParseInput3
> + 1.66% pgbench libpq.so.5.13 [.] pqPutMsgStart
> + 1.63% pgbench libpq.so.5.13 [.] pqGetInt

~ 13%

> + 3.16% pgbench libc-2.28.so [.] __strcmp_avx2
> + 2.95% pgbench libc-2.28.so [.] malloc
> + 1.85% pgbench libc-2.28.so [.] ppoll
> + 1.85% pgbench libc-2.28.so [.] __strlen_avx2
> + 1.85% pgbench libpthread-2.28.so [.] __libc_recv

~ 11%, str is a pain… Not sure who is calling though, pgbench or libpq.

This is basically 47% pgbench, 53% lib*, on the sample provided. I'm
unclear about where system time is measured.

> And that's the just the standard pgbench read/write case, without
> additional script commands or anything.

> Well, duh, that's because you're completely IO bound. You're doing
> 400tps. That's *nothing*. All you're measuring is how fast the WAL can
> be fdatasync()ed to disk. Of *course* pgbench isn't a relevant overhead
> in that case. I really don't understand how this can be an argument.

Sure. My interest in running it was to show that the \set stuff was
ridiculous compared to processing an actual SQL query, but it does not
allow to analyze all overheads. I hope that the 3 above examples allow to
make my point more understandable.

>> Also, pgbench overheads must be compared to an actual client application,
>> which deals with a database through some language (PHP, Python, JS, Java…)
>> the interpreter of which would be written in C/C++ just like pgbench, and
>> some library (ORM, DBI, JDBC…), possibly written in the initial language and
>> relying on libpq under the hood. Ok, there could be some JIT involved, but
>> it will not change that there are costs there too, and it would have to do
>> pretty much the same things that pgbench is doing, plus what the application
>> has to do with the data.
>
> Uh, but those clients aren't all running on a single machine.

Sure.

The cumulated power of the clients is probably much larger than the
postgres server itself, and ISTM that pgbench allows to simulate such
things with much smaller client-side requirements, and that any other tool
could not do much better.

--
Fabien.

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Julien Rouhaud 2019-08-02 08:42:42 Re: The unused_oids script should have a reminder to use the 8000-8999 OID range
Previous Message Daniel Migowski 2019-08-02 08:25:30 Re: [HACKERS] Cached plans and statement generalization