Re: [HACKERS] pgbench regression test failure

From: Fabien COELHO <coelho(at)cri(dot)ensmp(dot)fr>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Steve Singer <steve(at)ssinger(dot)info>, PostgreSQL Developers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [HACKERS] pgbench regression test failure
Date: 2017-11-20 08:45:58
Message-ID: alpine.DEB.2.20.1711200703240.11926@lancre
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers


Hello Tom,

Thanks for having a look at this bug fix.

> So we fixed the reported TPS rate, which was nowhere near reality,
> and the per-script stats are sane now. Good so far, but this
> still has two problems IMO:
>
> 1. The per-script stats shouldn't be printed at all if there's
> only one script. They're redundant with the overall stats.

Indeed.

I think the output should tend to be the same for possible automatic
processing, whether there is one script or more, even at the price of some
redundancy.

Obviously this is highly debatable.

> 2. ISTM that we should report that 100% of the transactions were
> above the latency limit, not 33%; that is, the appropriate base
> for the "number of transactions above the latency limit" percentage
> is the number of actual transactions not the number of scheduled
> transactions.

Hmmm. Allow me to disagree.

If the user asked for something, then this is the reference to take and
the whole report should be relative to that. Also, changing this makes the
report figures not adding up:

before:
number of transactions skipped: 417 (90.260 %)
number of transactions above the 1.0 ms latency limit: 45 (9.740 %)

Both above percent are on the same reference, fine. Can be added, 100% of
transactions we required had issues.

after:
number of transactions skipped: 457 (90.855 %)
number of transactions above the 1.0 ms latency limit: 46 (100.000 %)

Percents do not add up anymore. I do not think that this is an
improvement.

> I also noticed that if I specify "-f sleep-100.sql" more than once,
> the per-script TPS reports are out of line. This is evidently because
> that calculation isn't excluding skipped xacts; but if we're going to
> define tps as excluding skipped xacts, surely we should do so there too.

I do not think that we should exclude skipped xacts.

> I'm also still exceedingly unhappy about the NaN business.
> You have this comment in printSimpleStats:
> /* print NaN if no transactions where executed */
> but I find that unduly optimistic. It should really read more like
> "if no transactions were executed, at best we'll get some platform-
> dependent spelling of NaN. At worst we'll get a SIGFPE."

Hmmm. Alas you must be right about spelling. There has been no report of
SIGFPE issue, so I would not bother with that.

> I do not think that a pedantic argument about NaN being the "correct"
> answer justifies taking such portability risks, even if I bought that
> argument, which I don't particularly.

ISTM that the portability risk, i.e. about SIGFPE issue, would require a
non IEEE 754 conforming box, and there has been no complaint yet about
that.

> It's also inconsistent with this basic decision in printResults:
>
> /* Remaining stats are nonsensical if we failed to execute any xacts */
> if (total->cnt <= 0)
> return;

According to praise, this decision was by Tom Lane. It is about the whole
execution not executing anything, which may be seen as a special error
case, whereas a progress report without xact execution within some period
can be perfectly normal.

> If we thought NaN was the "correct" answer for no xacts then we'd just
> bull ahead and print NaNs all over the place.

Why not, but for me the issue wrt to the final report is distinct.

> I think we would be best advised to deal with this by printing zeroes in
> the progress-report case, and by skipping the output altogether in
> printSimpleStats. (Or we could make printSimpleStats print zeroes; I'm
> indifferent to that choice.)

I disagree with printing a zero if the measure is undefined. Zero does not
mean undefined. "Zero means zero":-)

Maybe we could replace platform-specific printing for "NaN" with something
else, either a deterministic "NaN" or some other string, eg "-" or "?" or
whatever. The benefit with sticking to some "NaN" or some platform
specific NaN is that processing the output can retrieve it as a float.

So basically I would have stayed with the current version which seems
consistent to me, but I agree that it is debatable. I'm not sure it is
worth being "exceedingly unhappy" about it, because these are special
corner cases anyway, not likely to be seen without somehow looking for it
or having pretty strange pgbench runs.

> Attached is an updated patch addressing these points. Comments?

I somehow disagree with your points above, for various reasons.

However, you are a committer, and I think best that the bug is fixed even
if I happen to disagree with unrelated changes. As an academic, I probably
have a number of silly ideas about a lot of things. This is part of the
job description:-)

--
Fabien.

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Ildar Musin 2017-11-20 08:53:38 Re: [HACKERS] Repetitive code in RI triggers
Previous Message Ashutosh Bapat 2017-11-20 08:38:22 Re: [HACKERS] Proposal: Local indexes for partitioned table