Quick Links

Re: Suggestion to add --continue-client-on-abort option to pgbench

From:	Yugo Nagata <nagata(at)sraoss(dot)co(dot)jp>
To:	"Hayato Kuroda (Fujitsu)" <kuroda(dot)hayato(at)fujitsu(dot)com>
Cc:	'Rintaro Ikeda' <ikedarintarof(at)oss(dot)nttdata(dot)com>, "slpmcf(at)gmail(dot)com" <slpmcf(at)gmail(dot)com>, "boekewurm+postgres(at)gmail(dot)com" <boekewurm+postgres(at)gmail(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>, Srinath Reddy Sadipiralla <srinath2133(at)gmail(dot)com>, Dilip Kumar <dilipbalaut(at)gmail(dot)com>
Subject:	Re: Suggestion to add --continue-client-on-abort option to pgbench
Date:	2025-06-13 15:24:53
Message-ID:	20250614002453.5c72f2ec80864d840150a642@sraoss.co.jp
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

On Mon, 9 Jun 2025 09:34:03 +0000
"Hayato Kuroda (Fujitsu)" <kuroda(dot)hayato(at)fujitsu(dot)com> wrote:

> > 2. The exit-on-abort option and continue-on-error option are mutually exclusive.
> > Therefore, I've updated the patch to throw a FATAL error when two options are
> > set simultaneously. Corresponding explanation was also added.

I don't think that's right since "abort" and "error" are different concept in pgbench.
(Here, "abort" refers to the termination of a client, not a transaction abort.)

The --exit-on-abort option forces to exit pgbench immediately when any client is aborted
due to some error. When the --continue-on-error option is not set, SQL errors other than
deadlock or serialization error cause a client to be aborted. On the other hand, when the option
is set, clients are not aborted due to any SQL errors; instead they continue to run after them.
However, clients can still be aborted for other reasons, such as connection failures or
meta-command errors (e.g., \set x 1/0). In these cases, the --exit-on-abort option remains
useful even when --continue-on-error is enabled.

> > (I'm wondering the name of parameter should be continue-on-abort so that users
> > understand the two option are mutually exclusive.)

For the same reason as above, I believe --continue-on-error is a more accurate description
of the option's behavior.

> 02. patch separation
> How about separating the patch series like:
>
> 0001 - contains option handling and retry part, and documentation
> 0002 - contains accumulation/reporting part
> 0003 - contains tests.
>
> I hope above style is more helpful for reviewers.

I'm not sure whether it's necessary to split the patch, as the change doesn't seem very
complex. However, the current separation appears inconsistent. For example, patch 0001
modifies canRetryError(), but patch 0002 reverts that change, and so on.

>
> 04. documentation
> ```
> + Client rolls back the failed transaction and starts a new one when its
> + transaction fails due to the reason other than the deadlock and
> + serialization failure. This allows all clients specified with -c option
> + to continuously apply load to the server, even if some transactions fail.
> ```
>
> I feel the description contains bit redundant part and misses the default behavior.
> How about:
> ```
> <para>
> Clients survive when their transactions are aborted, and they continue
> their run. Without the option, clients exit when transactions they run
> are aborted.
> </para>
> <para>
> Note that serialization failures or deadlock failures do not abort the
> client, so they are not affected by this option.
> See <xref linkend="failures-and-retries"/> for more information.
> </para>
> ```

I think we can make it clearer as follows:

Allows clients to continue their run even if an SQL statement fails due to errors other
than serialization or deadlock. Without this option, the client is aborted after
such errors.

Note that serialization and deadlock failures never cause the client to be aborted,
so they are not affected by this option. See <xref linkend="failures-and-retries"/>
for more information.

That said, a review by a native English speaker would still be appreciated.

Also, we would need to update several parts of the documentation. For example, the
"Failures and Serialization/Deadlock Retries" section should be revised to describe the
behavior change. In addition, we should update the explanations of output result examples
and logging, the description of the --failures-detailed option, and so on.

If transactions are not retried after SQL errors other than serialization or deadlock,
this should also be explicitly documented.

> 05. StatsData
> ```
> + * When continue-on-error option is specified,
> + * failed (the number of failed transactions) =
> + * 'other_sql_failures' (they got a error when continue-on-error option
> + * was specified).
> ```
>
> Let me confirm one point; can serialization_failures and deadlock_failures be
> counted when continue-on-error is true? If so, the comment seems not correct for me.
> The formula can be 'serialization_failures' + 'deadlock_failures' + 'deadlock_failures'
> in the case.

> 06. StatsData
> Another point; can other_sql_failures be counted when the continue-on-error is NOT
> specified? I feel it should be...

We could do that. However, if an SQL error other than a serialization or deadlock error
occurs when --continue-on-error is not set, pgbench will be aborted midway and the printed
results will be incomplete. Therefore, this might not make much sense.

> 06. usage()
> Added line is too long. According to program_help_ok(), the output by help should
> be less than 80.

Here are additional comments from me.

@@ -4548,6 +4570,8 @@ getResultString(bool skipped, EStatus estatus)
return "serialization";
case ESTATUS_DEADLOCK_ERROR:
return "deadlock";
+ case ESTATUS_OTHER_SQL_ERROR:
+ return "error (except serialization/deadlock)";

Strings returned by getResultString() are printed in the "time" field of the
log when both the -l and --failures-detailed options are set. Therefore, they
should be single words that do not contain any space characters. I wonder if
something like "other" or "other_sql_error" would be appropriate.

@@ -4099,6 +4119,7 @@ advanceConnectionState(TState *thread, CState *st, StatsData *agg)
* can retry the error.
*/
st->state = timer_exceeded ? CSTATE_FINISHED :
+ continue_on_error ? CSTATE_FAILURE :
doRetry(st, &now) ? CSTATE_RETRY : CSTATE_FAILURE;
}
else

This fix is not necessary because doRetry() (and canRetryError(), which is called
within it) will return false when continue_on_error is set (after applying patch 0002).

case PGRES_NONFATAL_ERROR:
case PGRES_FATAL_ERROR:
st->estatus = getSQLErrorStatus(PQresultErrorField(res,
PG_DIAG_SQLSTATE));
if (canRetryError(st->estatus))
{
if (verbose_errors)
commandError(st, PQerrorMessage(st->con));
goto error;
}
/* fall through */

default:
/* anything else is unexpected */
pg_log_error("client %d script %d aborted in command %d query %d: %s",
st->id, st->use_file, st->command, qrynum,
PQerrorMessage(st->con));
goto error;
}

When an SQL error other than a serialization or deadlock error occurs, an error message is
output via pg_log_error in this code path. However, I think this should be reported only
when verbose_errors is set, similar to how serialization and deadlock errors are handled when
--continue-on-error is enabled

Best regards,
Yugo Nagata

--
Yugo Nagata <nagata(at)sraoss(dot)co(dot)jp>

In response to

RE: Suggestion to add --continue-client-on-abort option to pgbench at 2025-06-09 09:34:03 from Hayato Kuroda (Fujitsu)

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Jeff Davis	2025-06-13 15:51:51	Re: pg_dump --with-* options
Previous Message	Rahila Syed	2025-06-13 15:01:22	Re: add function for creating/attaching hash table in DSM registry