Re: pgbench - implement strict TPC-B benchmark

From: Fabien COELHO <coelho(at)cri(dot)ensmp(dot)fr>
To: Andres Freund <andres(at)anarazel(dot)de>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, "Jonah H(dot) Harris" <jonah(dot)harris(at)gmail(dot)com>, Peter Geoghegan <pg(at)bowt(dot)ie>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, PostgreSQL Developers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: pgbench - implement strict TPC-B benchmark
Date: 2019-08-03 09:30:30
Message-ID: alpine.DEB.2.21.1908030845550.24235@lancre
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers


Hello Andres,

>>> Using pgbench -Mprepared -n -c 8 -j 8 -S pgbench_100 -T 10 -r -P1
>>> e.g. shows pgbench to use 189% CPU in my 4/8 core/thread laptop. That's
>>> a pretty significant share.
>>
>> Fine, but what is the corresponding server load? 211%? 611%? And what actual
>> time is spent in pgbench itself, vs libpq and syscalls?
>
> System wide pgbench, including libpq, is about 22% of the whole system.

Hmmm. I guess that the consistency between 189% CPU on 4 cores/8 threads
and 22% overall load is that 189/800 = 23.6% ~ 22%.

Given the simplicity of the select-only transaction the stuff is CPU
bound, so postgres 8 server processes should saturate the 4 core CPU, and
pgbench & postgres are competing for CPU time. The overall load is
probably 100%, i.e. 22% pgbench vs 78% postgres (assuming system is
included), 78/22 = 3.5, i.e. pgbench on one core would saturate postgres
on 3.5 cores on a CPU bound load.

I'm not chocked by these results for near worst-case conditions (i.e. the
server side has very little to do).

It seems quite consistent with the really worst-case example I reported
(empty query, cannot do less). Looking at the same empty-sql-query load
through "htop", I have 95% postgres and 75% pgbench. This is not fully
consistent with "time" which reports 55% pgbench overall, over 2/3 of
which in system, under 1/3 pgbench which must be devided into pgbench
actual code and external libpq/lib* other stuff.

Yet again, pgbench code is not the issue from my point of view, because
time is spent mostly elsewhere and any other driver would have to do the
same.

> As far as I can tell there's a number of things that are wrong:

Sure, I agree that things could be improved.

> - prepared statement names are recomputed for every query execution

I'm not sure it is a bug issue, but it should be precomputed somewhere,
though.

> - variable name lookup is done for every command, rather than once, when
> parsing commands

Hmmm. The names of variables are not all known in advance, eg \gset.
Possibly it does not matter, because the name of actually used variables
is known. Each used variables could get a number so that using a variable
would be accessing an array at the corresponding index.

> - a lot of string->int->string type back and forths

Yep, that is a pain, ISTM that strings are exchanged at the protocol
level, but this is libpq design, not pgbench.

As far as variable values are concerned, AFAICR conversion are performed
on demand only, and just once.

Overall, my point if that even if all pgbench-specific costs were wiped
out it would not change the final result (pgbench load) much because most
of the time is spent in libpq and system. Any other test driver would
incur the same cost.

>> Conclusion: pgbench-specific overheads are typically (much) below 10% of the
>> total client-side cpu cost of a transaction, while over 90% of the cpu cost
>> is spent in libpq and system, for the worst case do-nothing query.
>
> I don't buy that that's the actual worst case, or even remotely close to
> it.

Hmmm. I'm not sure I can do much worse than 3 complex expressions against
one empty sql query. Ok, I could put 27 complex expressions to reach
50-50, but the 3-to-1 complex-expression-to-empty-sql ratio already seems
ok for a realistic worst-case test script.

> I e.g. see higher pgbench overhead for the *modify* case than for
> the pgbench's readonly case. And that's because some of the meta
> commands are slow, in particular everything related to variables. And
> the modify case just has more variables.

Hmmm. WRT \set and expressions, the two main cost seems to be the large
switch and the variable management. Yet again, I still interpret the
figures I collected as these costs are small compared to libpq/system
overheads, and the overall result is below postgres own CPU costs (on a
per client basis).

>>> + 12.35% pgbench pgbench [.] threadRun
>>> + 3.54% pgbench pgbench [.] dopr.constprop.0
>>
>> ~ 21%, probably some inlining has been performed, because I would have
>> expected to see significant time in "advanceConnectionState".
>
> Yea, there's plenty inlining. Note dopr() is string processing.

Which is a pain, no doubt about that. Some of it as been taken out of
pgbench already, eg comparing commands vs using an enum.

>>> + 2.95% pgbench libpq.so.5.13 [.] PQsendQueryPrepared
>>> + 2.15% pgbench libpq.so.5.13 [.] pqPutInt
>>> + 4.47% pgbench libpq.so.5.13 [.] pqParseInput3
>>> + 1.66% pgbench libpq.so.5.13 [.] pqPutMsgStart
>>> + 1.63% pgbench libpq.so.5.13 [.] pqGetInt
>>
>> ~ 13%
>
> A lot of that is really stupid. We need to improve libpq.
> PqsendQueryGuts (attributed to PQsendQueryPrepared here), builds the
> command in many separate pqPut* commands, which reside in another
> translation unit, is pretty sad.

Indeed, I'm definitely convinced that libpq costs are high and should be
reduced where possible. Now, yet again, they are much smaller than the
time spent in the system to send and receive the data on a local socket,
so somehow they could be interpreted as good enough, even if not that
good.

>>> + 3.16% pgbench libc-2.28.so [.] __strcmp_avx2
>>> + 2.95% pgbench libc-2.28.so [.] malloc
>>> + 1.85% pgbench libc-2.28.so [.] ppoll
>>> + 1.85% pgbench libc-2.28.so [.] __strlen_avx2
>>> + 1.85% pgbench libpthread-2.28.so [.] __libc_recv
>>
>> ~ 11%, str is a pain… Not sure who is calling though, pgbench or
>> libpq.
>
> Both. Most of the strcmp is from getQueryParams()/getVariable(). The
> dopr() is from pg_*printf, which is mostly attributable to
> preparedStatementName() and getVariable().

Hmmm. Franckly I can optimize pgbench code pretty easily, but I'm not sure
of maintainability, and as I said many times, about the real effect it
would have, because these cost are a minor part of the client side
benchmark part.

>> This is basically 47% pgbench, 53% lib*, on the sample provided. I'm unclear
>> about where system time is measured.
>
> It was excluded in this profile, both to reduce profiling costs, and to
> focus on pgbench.

Ok.

If we take my other figures and round up, for a running pgbench we have
1/6 actual pgbench, 1/6 libpq, 2/3 system.

If I get a factor of 10 speedup in actual pgbench (let us assume I'm that
good:-), then the overall gain is 1/6 - 1/6/10 = 15%. Although I can do
it, it would be some fun, but the code would get ugly (not too bad, but
nevertheless probably less maintainable, with a partial typing phase and
expression compilation, and my bet is that however good the patch would be
rejected).

Do you see an error in my evaluation of pgbench actual costs and its
contribution to the overall performance of running a benchmark?

If yes, which it is?

If not, do you think advisable to spend time improving the evaluator &
variable stuff and possibly other places for an overall 15% gain?

Also, what would be the likelyhood of such optimization patch to pass?

I could do a limited variable management improvement patch, eventually, I
have funny ideas to speedup the thing, some of which outlined above, some
others even more terrible.

--
Fabien.

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Julien Rouhaud 2019-08-03 09:40:24 Re: The unused_oids script should have a reminder to use the 8000-8999 OID range
Previous Message Ivan Panchenko 2019-08-03 07:03:32 Re[2]: jsonb_plperl bug