Re: pgbench logging broken by time logic changes

From: Fabien COELHO <coelho(at)cri(dot)ensmp(dot)fr>
To: Michael Paquier <michael(at)paquier(dot)xyz>
Cc: Andrew Dunstan <andrew(at)dunslane(dot)net>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, Alvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org>, Gregory Smith <gregsmithpgsql(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, david(dot)christensen(at)crunchydata(dot)com
Subject: Re: pgbench logging broken by time logic changes
Date: 2021-06-23 06:26:32
Message-ID: alpine.DEB.2.22.394.2106230802100.1117163@pseudo
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers


Hello,

> +# note: this test is time sensitive, and may fail on a very
> +# loaded host.
> +# note: --progress-timestamp is not tested
> +my $delay = pgbench(
> + '-T 2 -P 1 -l --aggregate-interval=1 -S -b se(at)2'
> + . ' --rate=20 --latency-limit=1000 -j ' . $nthreads
> + . ' -c 3 -r',
> + 0,
> + [ qr{type: multiple},
> + qr{clients: 3},
> + qr{threads: $nthreads},
> + qr{duration: 2 s},
> + qr{script 1: .* select only},
> + qr{script 2: .* select only},
> + qr{statement latencies in milliseconds},
> + qr{FROM pgbench_accounts} ],
> + [ qr{vacuum}, qr{progress: 1\b} ],
> + 'pgbench progress', undef,
> + "--log-prefix=$bdir/001_pgbench_log_1");

> Could we make that shorter at 1s? That will shorten the duration of
> the test run. It is easy to miss that this test is for
> --aggregate-interval (aka the logAgg() path) without a comment.

It is for -T, -P and --aggregate-interval. The units of all these is
seconds, the minimum is 1, I put 2 so that It is pretty sure to get some
output. We could try 1, but I'm less confident that an output is ensured
in all cases on a very slow host which may decide to shorten the run
before having shown a progress for instance.

> +# cool check that we are around 2 seconds
> +# The rate may results in an unlucky schedule which triggers
> +# an early exit, hence the loose bound.
> +ok(0.0 < $delay && $delay < 4.0, "-T 2 run around 2 seconds");
>
> The command itself would not fail, but we would just fail on a machine
> where the difference in start/stop time is higher than 4 seconds,
> right?

Yep. It could detect a struck pgbench process which was one of the
reported issue. Maybe there should be a timeout added.

> On RPI-level machines, this could fail reliably.

Dunno, Not sure what RPI means. Probably not "Retail Price Index"… maybe
Rasberry-Pi? In that case, the O-4 second leeway is intended to be loose
enough to accomodate such hosts, but I cannot test that.

> I am not completely sure what's the additional value we can get from
> that extra test, to be honest.

This would be to detect a somehow stuck process. It could be looser if
necessary. Or removed, or preferably commented out, or enabled with some
options (eg an environment variable? configure option?). Such control
could also skip all 3 calls, in which case the 2 seconds is not an issue.

> +# $nthreads threads, 2 seconds, but due to timing imprecision we might get
> +# only 1 or as many as 3 progress reports per thread.
> +check_pgbench_logs($bdir, '001_pgbench_log_1', $nthreads, 1, 3,
> + qr{^\d{10,} \d{1,2} \d+ \d+ \d+ \d+ \d+ \d+ \d+ \d+ \d+$});
> +
> Now this one is good and actually stable thanks to the fact that we'd
> get at least the final logs, and the main complain we got to discuss
> about on this thread was the format of the logs.

Yep, this test would have probably detected the epoch issue reported by
Greg.

> I would say to give up on the first test, and keep the second.

You mean remove the time check and keep the log check. I'd like to keep a
time check, or at least have it in comment so that I can enable it simply.

> But, is this regex correct? Shouldn't we check for six integer fields
> only with the first one having a minimal number of digits for the epoch?

Epoch (seconds since 1970-01-01?) is currently 10 digits. Not sure how
well it would work if some host have another zero start date.

Given the options of the bench run, there are that many fields in the
log output, I'm not sure why we would want to check for less?

--
Fabien.

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Nicolas CHAHWEKILIAN 2021-06-23 06:30:09 Re: [PATCH] Hooks at XactCommand level
Previous Message Amit Kapila 2021-06-23 06:14:13 Re: [bug?] Missed parallel safety checks, and wrong parallel safety