Re: pgbench rate limiting changes transaction latency computation

From: Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
To: Andres Freund <andres(at)anarazel(dot)de>, Fabien COELHO <coelho(at)cri(dot)ensmp(dot)fr>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: pgbench rate limiting changes transaction latency computation
Date: 2019-06-11 08:31:15
Message-ID: 8b2d197d-9d3e-5eb5-ab9c-6ef52ac72edd@iki.fi
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 11/06/2019 10:12, Andres Freund wrote:
> On 2019-06-11 08:36:55 +0200, Fabien COELHO wrote:
>> This behavior under -R is fully voluntary, and the result above just show
>> that the database cannot really keep up with the load, which is simply the
>> case, so for me it is okay to show bad figures.
>
> I mean, you just turned one named value, into a different one, without
> renaming it. And the new meaning under -R, is basically the same as one
> that's already there (lag). Also note that it also can actually keep up
> in the above example.

It's not fair to say that its meaning was changed. Before 9.4, there was
no -R option. As Fabien said, the reported latency is the latency seen
by the imaginary user of the system, and without -R, there's no lag so
it's the same number. See also how it works with --latency-limit. The
limit is on the reported latency, which includes the lag.

Yeah, I can see that the server-observed transaction latency would often
be more useful than what's printed now. But changing the current meaning
doesn't seem like a good idea.

>> The idea under throttling is to model a client which would want the result
>> of a query at a certain point in time, say a query for a web page which is
>> being generated, which is the scheduled time. It is the when the client
>> knows it wants an answer. If it is not processed immediately, that is bad
>> for its client perceived latency.
>
>> Whether this is due to lag (i.e. the server is loaded and cannot start to
>> process the answer) or because the server is slow to answer is irrelevant,
>> the client is waiting, the web page is not generated, the system is slow. So
>> latency under -R is really "client latency", not only query latency, as it
>> is documented.
>
> What does that have to do with incorporating the same data into both lag
> and latency? I just fail to see what the point is, except to make it
> unnecessarily harder to compare postgres' behaviour under both a
> throttled and push-it-to-the-breaking point loads.
>
> How long individual transactions take, and how much variance there is in
> that, is something *crucial* to optimize for. *Especially* when the
> machine/load is provisioned in a way to not overload the machine.
>
> How is e.g.
> progress: 1.6 s, 0.0 tps, lat 0.000 ms stddev 0.000, lag 0.000 ms
> progress: 2.0 s, 103546.5 tps, lat 1584.161 ms stddev 35.589, lag 1582.043 ms
> progress: 3.0 s, 108535.2 tps, lat 1347.619 ms stddev 101.782, lag 1346.170 ms
> progress: 4.0 s, 108528.8 tps, lat 996.603 ms stddev 106.052, lag 995.159 ms
> progress: 5.0 s, 109468.8 tps, lat 633.464 ms stddev 108.483, lag 632.030 ms
> progress: 6.0 s, 110606.7 tps, lat 252.923 ms stddev 110.391, lag 251.505 ms
> progress: 7.0 s, 84253.3 tps, lat 6.829 ms stddev 15.067, lag 6.423 ms
> progress: 8.0 s, 80470.7 tps, lat 0.142 ms stddev 0.079, lag 0.017 ms
> progress: 9.0 s, 80104.2 tps, lat 0.142 ms stddev 0.081, lag 0.017 ms
> progress: 10.0 s, 80277.0 tps, lat 0.152 ms stddev 0.150, lag 0.017 ms
>
> the lat column adds basically nothing over the lag column here.
>
> more useful than:
> progress: 1.3 s, 0.0 tps, lat 0.000 ms stddev 0.000, lag 0.000 ms
> progress: 2.0 s, 116315.6 tps, lat 1.425 ms stddev 1.440, lag 1087.076 ms
> progress: 3.0 s, 113526.2 tps, lat 1.390 ms stddev 0.408, lag 709.908 ms
> progress: 4.0 s, 111816.4 tps, lat 1.407 ms stddev 0.399, lag 302.866 ms
> progress: 5.0 s, 88061.9 tps, lat 0.543 ms stddev 0.652, lag 16.526 ms
> progress: 6.0 s, 80045.4 tps, lat 0.128 ms stddev 0.079, lag 0.017 ms
> progress: 7.0 s, 79636.3 tps, lat 0.124 ms stddev 0.073, lag 0.016 ms
> progress: 8.0 s, 80535.3 tps, lat 0.125 ms stddev 0.073, lag 0.016 ms
>
> where I can see that the transactions are now actually fast enough.
> Obviously this is a toy example, but this really make -R close to
> useless to me. I often want to switch from a unthrottled to a 90% load,
> and improve the outlier beheaviour - but that outlier behaviour is
> hidden due to this redefinition of lat (as the issue is now reported
> over a much longer period of time, as it includes lag).

The outlier behavior seems very visible in both of the above. The system
completely stalled for about 1-2 seconds. And then it takes a few
seconds to process the backlog and catch up.

For testing the server under full load, like during that catch up
period, testing without -R seems better. Or perhaps you'd want to use
the --latency-limit option? You said that the transactions are now "fast
enough", so that might be a better fit for what you're trying to model.

> I think we should just restore lat to a sane behaviour under -R, and if
> you want to have lat + lag as a separate column in -R mode, then let's
> do that.

It seems like a bad idea to change the meaning of the value after the
fact. Would be good to at least rename it, to avoid confusion. Maybe
that's not too important for the interactive -P reports, but got to be
mindful about the numbers logged in the log file, at least.

If you change it, please also consider how it plays together with
--latency-limit.

- Heikki

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Peter Eisentraut 2019-06-11 08:49:17 Re: [PATCH] Implement uuid_version()
Previous Message Andres Freund 2019-06-11 08:22:20 Re: Hash join explain is broken