Re: [HACKERS] pgbench regression test failure

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Fabien COELHO <coelho(at)cri(dot)ensmp(dot)fr>
Cc: Steve Singer <steve(at)ssinger(dot)info>, PostgreSQL Developers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [HACKERS] pgbench regression test failure
Date: 2017-11-19 21:11:50
Message-ID: 28977.1511125910@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Fabien COELHO <coelho(at)cri(dot)ensmp(dot)fr> writes:
> Here is the hopefully right version, which passes tests here.

I took a look at this. What I get for the "sleep-100" test case posted in
https://postgr.es/m/alpine.DEB.2.20.1709240820160.4999@lancre
with current HEAD is

progress: 1.0 s, 16.0 tps, lat 37.746 ms stddev 48.730, lag 0.039 ms, 10 skipped
progress: 2.0 s, 26.0 tps, lat 27.115 ms stddev 44.672, lag 0.068 ms, 19 skipped
progress: 3.0 s, 25.0 tps, lat 32.184 ms stddev 46.915, lag 0.034 ms, 17 skipped
progress: 4.0 s, 22.0 tps, lat 22.868 ms stddev 42.167, lag 0.033 ms, 17 skipped
transaction type: sleep-100.sql
scaling factor: 1
query mode: simple
number of clients: 1
number of threads: 1
number of transactions per client: 100
number of transactions actually processed: 29/100
number of transactions skipped: 71 (71.000 %)
number of transactions above the 1.0 ms latency limit: 29 (29.000 %)
latency average = 100.636 ms
latency stddev = 0.185 ms
rate limit schedule lag: avg 0.044 (max 0.956) ms
tps = 22.467484 (including connections establishing)
tps = 22.488151 (excluding connections establishing)
script statistics:
- number of transactions skipped: 0 (-nan%)

The patch changes that to

progress: 1.0 s, 6.0 tps, lat 100.658 ms stddev 0.193, lag 0.101 ms, 10 skipped
progress: 2.0 s, 8.0 tps, lat 100.684 ms stddev 0.264, lag 0.208 ms, 14 skipped
progress: 3.0 s, 8.0 tps, lat 100.576 ms stddev 0.026, lag 0.101 ms, 22 skipped
progress: 4.0 s, 6.0 tps, lat 100.542 ms stddev 0.080, lag 0.102 ms, 10 skipped
transaction type: sleep-100.sql
scaling factor: 1
query mode: simple
number of clients: 1
number of threads: 1
number of transactions per client: 100
number of transactions actually processed: 33/100
number of transactions skipped: 67 (67.000 %)
number of transactions above the 1.0 ms latency limit: 33 (33.000 %)
latency average = 100.613 ms
latency stddev = 0.169 ms
rate limit schedule lag: avg 0.043 (max 0.921) ms
tps = 7.138174 (including connections establishing)
tps = 7.144332 (excluding connections establishing)
script statistics:
- number of transactions skipped: 67 (67.000%)

So we fixed the reported TPS rate, which was nowhere near reality,
and the per-script stats are sane now. Good so far, but this
still has two problems IMO:

1. The per-script stats shouldn't be printed at all if there's
only one script. They're redundant with the overall stats.

2. ISTM that we should report that 100% of the transactions were
above the latency limit, not 33%; that is, the appropriate base
for the "number of transactions above the latency limit" percentage
is the number of actual transactions not the number of scheduled
transactions.

I also noticed that if I specify "-f sleep-100.sql" more than once,
the per-script TPS reports are out of line. This is evidently because
that calculation isn't excluding skipped xacts; but if we're going to
define tps as excluding skipped xacts, surely we should do so there too.

I'm also still exceedingly unhappy about the NaN business.
You have this comment in printSimpleStats:
/* print NaN if no transactions where executed */
but I find that unduly optimistic. It should really read more like
"if no transactions were executed, at best we'll get some platform-
dependent spelling of NaN. At worst we'll get a SIGFPE." I do not think
that a pedantic argument about NaN being the "correct" answer justifies
taking such portability risks, even if I bought that argument, which
I don't particularly. It's also inconsistent with this basic decision in
printResults:

/* Remaining stats are nonsensical if we failed to execute any xacts */
if (total->cnt <= 0)
return;

If we thought NaN was the "correct" answer for no xacts then we'd just
bull ahead and print NaNs all over the place. I think we would be best
advised to deal with this by printing zeroes in the progress-report case,
and by skipping the output altogether in printSimpleStats. (Or we could
make printSimpleStats print zeroes; I'm indifferent to that choice.)

Attached is an updated patch addressing these points. Comments?

regards, tom lane

Attachment Content-Type Size
pgbench-progress-skip-3.patch text/x-diff 7.0 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Thomas Munro 2017-11-19 21:24:00 Re: [HACKERS] [PATCH] Incremental sort
Previous Message jotpe 2017-11-19 20:17:46 Re: percentile value check can be slow