Re: [HACKERS] WIP Patch: Pgbench Serialization and deadlock errors

From: Fabien COELHO <coelho(at)cri(dot)ensmp(dot)fr>
To: Marina Polyakova <m(dot)polyakova(at)postgrespro(dot)ru>
Cc: pgsql-hackers(at)postgresql(dot)org, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, Ildus Kurbangaliev <i(dot)kurbangaliev(at)postgrespro(dot)ru>, Teodor Sigaev <teodor(at)sigaev(dot)ru>
Subject: Re: [HACKERS] WIP Patch: Pgbench Serialization and deadlock errors
Date: 2018-09-11 15:29:03
Message-ID: alpine.DEB.2.21.1809111703590.30244@lancre
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers


Hello Marina,

> Hmm, but we can say the same for serialization or deadlock errors that were
> not retried (the client test code itself could not run correctly or the SQL
> sent was somehow wrong, which is also the client's fault), can't we?

I think not.

If a client asks for something "legal", but some other client in parallel
happens to make an incompatible change which result in a serialization or
deadlock error, the clients are not responsible for the raised errors, it
is just that they happen to ask for something incompatible at the same
time. So there is no user error per se, but the server is reporting its
(temporary) inability to process what was asked for. For these errors,
retrying is fine. If the client was alone, there would be no such errors,
you cannot deadlock with yourself. This is really an isolation issue
linked to parallel execution.

> Why not handle client errors that can occur (but they may also not
> occur) the same way? (For example, always abort the client, or
> conversely do not make aborts in these cases.) Here's an example of such
> error:

> client 5 got an error in command 1 (SQL) of script 0; ERROR: division by zero

This is an interesting case. For me we must stop the script because the
client is asking for something "stupid", and retrying the same won't
change the outcome, the division will still be by zero. It is the client
responsability not to ask for something stupid, the bench script is buggy,
it should not submit illegal SQL queries. This is quite different from
submitting something legal which happens to fail.

> Maybe we can limit the number of failures in one statement, and abort the
> client if this limit is exceeded?...

I think this is quite debatable, and that the best option is to leavze
this point out of the current patch, so that we could have retry on
serial/deadlock errors.

Then you can submit another patch for a feature about other errors if you
feel that there is a use case for going on in some cases. I think that the
previous behavior made sense, and that changing it should only be
considered as an option. As it involves discussing and is not obvious,
later is better.

> To get a clue about the actual issue you can use the options
> --failures-detailed (to find out out whether this is a serialization failure
> / deadlock failure / other SQL failure / meta command failure) and/or
> --print-errors (to get the complete error message).

Yep, but for me it should haved stopped immediately, as it did before.

> If you use the option --latency-limit, the time of tries will be limited
> regardless of the use of the option -t. Therefore ISTM that an unlimited
> number of tries can be used only if the time of tries is limited by the
> options -T and/or -L.

Indeed, I'm ok with forbidding unlimitted retries when under -t.

>> I'm not sure that having "--debug" implying this option
>> is useful: As there are two distinct options, the user may be allowed
>> to trigger one or the other as they wish?
>
> I'm not sure that the main debugging output will give a good clue of what's
> happened without full messages about errors, retries and failures...

I'm more argumenting about letting the user decide what they want.

> These lines are quite long - do you suggest to wrap them this way?

Sure, if it is too long, then wrap.

>> Function getTransactionStatus name does not seem to correspond fully to
>> what the function does. There is a passthru case which should be either
>> avoided or clearly commented.
>
> I don't quite understand you - do you mean that in fact this function finds
> out whether we are in a (failed) transaction block or not? Or do you mean
> that the case of PQTRANS_INTRANS is also ok?...

The former: although the function is named "getTransactionStatus", it does
not really return the "status" of the transaction (aka PQstatus()?).

> I tried not to break the limit of 80 characters, but if you think that this
> is better, I'll change it.

Hmmm. 80 columns, indeed...

>> I'd insist in a comment that "cnt" does not include "skipped" transactions
>> (anymore).
>
> If you mean CState.cnt I'm not sure if this is practically useful because the
> code uses only the sum of all client transactions including skipped and
> failed... Maybe we can rename this field to nxacts or total_cnt?

I'm fine with renaming the field if it makes thinks clearer. They are all
counters, so naming them "cnt" or "total_cnt" does not help much. Maybe
"succeeded" or "success" to show what is really counted?

--
Fabien.

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Andres Freund 2018-09-11 15:29:29 Re: StandbyAcquireAccessExclusiveLock doesn't necessarily
Previous Message Tom Lane 2018-09-11 15:24:41 Re: Flexible configuration for full-text search