Re: [HACKERS] WIP Patch: Pgbench Serialization and deadlock errors

From: Marina Polyakova <m(dot)polyakova(at)postgrespro(dot)ru>
To: Fabien COELHO <coelho(at)cri(dot)ensmp(dot)fr>
Cc: pgsql-hackers(at)postgresql(dot)org, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, Ildus Kurbangaliev <i(dot)kurbangaliev(at)postgrespro(dot)ru>, Teodor Sigaev <teodor(at)sigaev(dot)ru>
Subject: Re: [HACKERS] WIP Patch: Pgbench Serialization and deadlock errors
Date: 2018-09-11 13:47:57
Message-ID: f167595f992d31710048f017d48626da@postgrespro.ru
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 08-09-2018 16:03, Fabien COELHO wrote:
> Hello Marina,
>
>> v11-0003-Pgbench-errors-and-serialization-deadlock-retrie.patch
>> - the main patch for handling client errors and repetition of
>> transactions with serialization/deadlock failures (see the detailed
>> description in the file).
>
> About patch v11-3.
>
> Patch applies cleanly on top of the other two. Compiles, global and
> local
> "make check" are ok.

:-)

> * Features
>
> As far as the actual retry feature is concerned, I'd say we are nearly
> there. However I have an issue with changing the behavior on meta
> command and other sql errors, which I find not desirable.
>
> When a meta-command fails, before the patch the command is aborted and
> there is a convenient error message:
>
> sh> pgbench -T 10 -f bad-meta.sql
> bad-meta.sql:1: unexpected function name (false) in command "set"
> [...]
> \set i false + 1 [...]
>
> After the patch it is simply counted, pgbench loops on the same error
> till the time is completed, and there are no clue about the actual
> issue:
>
> sh> pgbench -T 10 -f bad-meta.sql
> starting vacuum...end.
> transaction type: bad-meta.sql
> duration: 10 s
> number of transactions actually processed: 0
> number of failures: 27993953 (100.000%)
> ...
>
> Same thing about SQL errors, an immediate abort...
>
> sh> pgbench -T 10 -f bad-sql.sql
> starting vacuum...end.
> client 0 aborted in command 0 of script 0; ERROR: syntax error at or
> near ";"
> LINE 1: SELECT 1 + ;
>
> ... is turned into counting without aborting nor error messages, so
> that there is no clue that the user was asking for something bad.
>
> sh> pgbench -T 10 -f bad-sql.sql
> starting vacuum...end.
> transaction type: bad-sql.sql
> scaling factor: 1
> query mode: simple
> number of clients: 1
> number of threads: 1
> duration: 10 s
> number of transactions actually processed: 0
> number of failures: 274617 (100.000%)
> # no clue that there was a syntax error in the script
>
> I do not think that these changes of behavior are desirable. Meta
> command and
> miscellaneous SQL errors should result in immediatly aborting the whole
> run,
> because the client test code itself could not run correctly or the SQL
> sent
> was somehow wrong, which is also the client's fault, and the server
> performance bench does not make much sense in such conditions.
>
> ISTM that the focus of this patch should only be to handle some server
> runtime errors that can be retryed, but not to change pgbench behavior
> on other kind of errors. If these are to be changed, ISTM that it
> would be a distinct patch and would require some discussion, and
> possibly an option to enable it or not if some use case emerge. AFA
> this patch is concerned, I'd suggest to let that out.
...
> The following remarks are linked to the change of behavior discussed
> above:
> makeVariableValue error message is not for debug, but must be kept in
> all
> cases, and the false returned must result in an immediate abort. Same
> thing about
> lookupCreateVariable, an invalid name is a user error which warrants
> an immediate
> abort. Same thing again about coerce* functions or evalStandardFunc...
> Basically, most/all added "debug_level >= DEBUG_ERRORS" are not
> desirable.

Hmm, but we can say the same for serialization or deadlock errors that
were not retried (the client test code itself could not run correctly or
the SQL sent was somehow wrong, which is also the client's fault), can't
we? Why not handle client errors that can occur (but they may also not
occur) the same way? (For example, always abort the client, or
conversely do not make aborts in these cases.) Here's an example of such
error:

starting vacuum...end.
transaction type: pgbench_rare_sql_error.sql
scaling factor: 1
query mode: simple
number of clients: 10
number of threads: 1
number of transactions per client: 250
number of transactions actually processed: 2500/2500
maximum number of tries: 1
latency average = 0.375 ms
tps = 26695.292848 (including connections establishing)
tps = 27489.678525 (excluding connections establishing)
statement latencies in milliseconds and failures:
0.001 0 \set divider random(-1000, 1000)
0.245 0 SELECT 1 / :divider;

starting vacuum...end.
client 5 got an error in command 1 (SQL) of script 0; ERROR: division
by zero

client 0 got an error in command 1 (SQL) of script 0; ERROR: division
by zero

client 7 got an error in command 1 (SQL) of script 0; ERROR: division
by zero

transaction type: pgbench_rare_sql_error.sql
scaling factor: 1
query mode: simple
number of clients: 10
number of threads: 1
number of transactions per client: 250
number of transactions actually processed: 2497/2500
number of failures: 3 (0.120%)
number of serialization failures: 0 (0.000%)
number of deadlock failures: 0 (0.000%)
number of other SQL failures: 3 (0.120%)
maximum number of tries: 1
latency average = 0.579 ms (including failures)
tps = 17240.662547 (including connections establishing)
tps = 17862.090137 (excluding connections establishing)
statement latencies in milliseconds and failures:
0.001 0 \set divider random(-1000, 1000)
0.338 3 SELECT 1 / :divider;

Maybe we can limit the number of failures in one statement, and abort
the client if this limit is exceeded?...

To get a clue about the actual issue you can use the options
--failures-detailed (to find out out whether this is a serialization
failure / deadlock failure / other SQL failure / meta command failure)
and/or --print-errors (to get the complete error message).

> Doc says "you cannot use an infinite number of retries without
> latency-limit..."
>
> Why should this be forbidden? At least if -T timeout takes precedent
> and
> shortens the execution, ISTM that there could be good reason to test
> that.
> Maybe it could be blocked only under -t if this would lead to an
> non-ending
> run.
...
> * Comments
>
> "There're different types..." -> "There are different types..."
>
> "after the errors and"... -> "after errors and"...
>
> "the default value of max_tries is set to 1" -> "the default value
> of max_tries is 1"
>
> "We cannot retry the transaction" -> "We cannot retry a transaction"
>
> "may ultimately succeed or get a failure," -> "may ultimately succeed
> or fail,"
...
> * Documentation:
>
> Some suggestions which may be improvements, although I'm not a native
> English
> speaker.
>
> ISTM that there are too many "the":
> - "turns on the option ..." -> "turns on option ..."
> - "When the option ..." -> "When option ..."
> - "By default the option ..." -> "By default option ..."
> - "only if the option ..." -> "only if option ..."
> - "combined with the option ..." -> "combined with option ..."
> - "without the option ..." -> "without option ..."
> - "is the sum of all the retries" -> "is the sum of all retries"
>
> "infinite" -> "unlimited"
>
> "not retried at all" -> "not retried" (maybe several times).
>
> "messages of all errors" -> "messages about all errors".
>
> "It is assumed that the scripts used do not contain" ->
> "It is assumed that pgbench scripts do not contain"

Thank you, I'll fix this.

If you use the option --latency-limit, the time of tries will be limited
regardless of the use of the option -t. Therefore ISTM that an unlimited
number of tries can be used only if the time of tries is limited by the
options -T and/or -L.

> As "--print-errors" is really for debug, maybe it could be named
> "--debug-errors".

Ok!

> I'm not sure that having "--debug" implying this option
> is useful: As there are two distinct options, the user may be allowed
> to trigger one or the other as they wish?

I'm not sure that the main debugging output will give a good clue of
what's happened without full messages about errors, retries and
failures...

> * Code
>
> <...>
>
> sendRollback(): I'd suggest to simplify. The prepare/extended statement
> stuff is
> really about the transaction script, not dealing with errors, esp as
> there is no
> significant advantage in preparing a "ROLLBACK" statement which is
> short and has
> no parameters. I'd suggest to remove this function and just issue
> PQsendQuery("ROLLBACK;") in all cases.

Ok!

> In copyVariables, I'd simplify
>
> + if (source_var->svalue == NULL)
> + dest_var->svalue = NULL;
> + else
> + dest_var->svalue = pg_strdup(source_var->svalue);
>
> as:
>
> dest_var->value = (source_var->svalue == NULL) ? NULL :
> pg_strdup(source_var->svalue);

> About:
>
> + if (doRetry(st, &now))
> + st->state = CSTATE_RETRY;
> + else
> + st->state = CSTATE_FAILURE;
>
> -> st->state = doRetry(st, &now) ? CSTATE_RETRY : CSTATE_FAILURE;

These lines are quite long - do you suggest to wrap them this way?

+ dest_var->svalue = ((source_var->svalue == NULL) ? NULL :
+ pg_strdup(source_var->svalue));

+ st->state = (doRetry(st, &now) ? CSTATE_RETRY :
+ CSTATE_FAILURE);

> + if (sqlState) -> if (sqlState != NULL) ?

Ok!

> Function getTransactionStatus name does not seem to correspond fully to
> what the
> function does. There is a passthru case which should be either avoided
> or
> clearly commented.

I don't quite understand you - do you mean that in fact this function
finds out whether we are in a (failed) transaction block or not? Or do
you mean that the case of PQTRANS_INTRANS is also ok?...

> About:
>
> - commandFailed(st, "SQL", "perhaps the backend died while
> processing");
> + clientAborted(st,
> + "perhaps the backend died while processing");
>
> keep on one line?

I tried not to break the limit of 80 characters, but if you think that
this is better, I'll change it.

> Overall, the comment text in StatsData is very clear. However they are
> not
> clearly linked to the struct fields. I'd suggest that earch field when
> used
> should be quoted, so as to separate English from code, and the struct
> name
> should always be used explicitely when possible.

Ok!

> I'd insist in a comment that "cnt" does not include "skipped"
> transactions
> (anymore).

If you mean CState.cnt I'm not sure if this is practically useful
because the code uses only the sum of all client transactions including
skipped and failed... Maybe we can rename this field to nxacts or
total_cnt?

> About v11-4. I'm do not feel that these changes are very
> useful/important for now. I'd propose that your prioritize on updating
> 11-3 so that we can have another round about it as soon as possible,
> and keep that one later.

Ok!

--
Marina Polyakova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Marina Polyakova 2018-09-11 13:55:23 Re: [HACKERS] WIP Patch: Pgbench Serialization and deadlock errors
Previous Message Jesper Pedersen 2018-09-11 13:21:57 Re: Index Skip Scan