Re: [HACKERS] WIP Patch: Pgbench Serialization and deadlock errors

From: Marina Polyakova <m(dot)polyakova(at)postgrespro(dot)ru>
To: Fabien COELHO <coelho(at)cri(dot)ensmp(dot)fr>
Cc: PostgreSQL Developers <pgsql-hackers(at)postgresql(dot)org>, Kevin Grittner <kgrittn(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>
Subject: Re: [HACKERS] WIP Patch: Pgbench Serialization and deadlock errors
Date: 2017-12-05 06:44:22
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

> Hello,

Hi! I'm very sorry that I did not answer for so long, I was very busy in
the release of Postgres Pro 10 :(

>> Here is the third version of the patch for pgbench thanks to Fabien
>> Coelho comments. As in the previous one, transactions with
>> serialization and deadlock failures are rolled back and retried until
>> they end successfully or their number of tries reaches maximum.
> Here is some partial review.

Thank you very much for it!

> It compiles with warnings, please fix them:
> pgbench.c:2624:28: warning: ‘failure_status’ may be used
> uninitialized in this function
> pgbench.c:2697:34: warning: ‘command’ may be used uninitialized in
> this function


> I do not think that the error handling feature needs preeminence in the
> final report, compare to scale, number of clients and so. The number
> of tries should be put further on.

I added it here only because both this field and field "transaction
type" are transaction characteristics. I have some doubts where to add
it. On the one hand, the number of clients, the number of transactions
per client and the number of transactions actually processed form a good
logical block which I don't want to divide. On the other hand, the
number of clients and the number of transactions per client are
parameters, but the number of transactions actually processed is one of
the program results. Where, in your opinion, would it be better to add
the maximum number of transaction tries?

> I would spell "number of tries" instead of "tries number" which seems
> to
> suggest that each try is attributed a number. "sql" -> "SQL".


> For the per statement latency final report, I do not think it is worth
> distinguishing the kind of retry at this level, because ISTM that
> serialization & deadlocks are unlikely to appear simultaneously. I
> would just report total failures and total tries on this report. We
> only have 2 errors now, but if more are added I'm pretty sure that we
> would not want to have more columns...

Thanks, I agree with you.

> Moreover the 25 characters
> alignment is ugly, better use a much smaller alignment.

The variables for the numbers of failures and retries are of type int64
since the variable for the total number of transactions has the same
type. That's why such a large alignment (as I understand it now, enough
20 characters). Do you prefer floating alignemnts, depending on the
maximum number of failures/retries for any command in any script?

> I'm okay with having details shown in the "log to file" group report.

I think that the output format of retries statistics should be same
everywhere, so I would just like to output the total number of retries

> The documentation does not seem consistent. It discusses "the very last
> fields"
> and seem to suggest that there are two, but the example trace below
> just
> adds one field.

I'm sorry, I do not understand what you are talking about. I used
commands and the files from the end of your message ("psql <
dl_init.sql" and "pgbench -f dl_trans.sql -c 8 -T 10 -P 1"), and I got
this output from pgbench:

starting vacuum...ERROR: relation "pgbench_branches" does not exist
(ignoring this error and continuing anyway)
ERROR: relation "pgbench_tellers" does not exist
(ignoring this error and continuing anyway)
ERROR: relation "pgbench_history" does not exist
(ignoring this error and continuing anyway)
progress: 1.0 s, 14.0 tps, lat 9.094 ms stddev 5.304
progress: 2.0 s, 25.0 tps, lat 284.934 ms stddev 450.692, 1 failed
progress: 3.0 s, 21.0 tps, lat 337.942 ms stddev 473.210, 1 failed
progress: 4.0 s, 11.0 tps, lat 459.041 ms stddev 499.908, 2 failed
progress: 5.0 s, 28.0 tps, lat 220.219 ms stddev 411.390, 2 failed
progress: 6.0 s, 5.0 tps, lat 402.695 ms stddev 492.526, 2 failed
progress: 7.0 s, 24.0 tps, lat 343.249 ms stddev 626.181, 2 failed
progress: 8.0 s, 14.0 tps, lat 505.396 ms stddev 501.836, 1 failed
progress: 9.0 s, 40.0 tps, lat 180.080 ms stddev 381.335, 1 failed
progress: 10.0 s, 1.0 tps, lat 0.000 ms stddev 0.000, 1 failed
transaction type: dl_trans.sql
transaction maximum tries number: 1
scaling factor: 1
query mode: simple
number of clients: 8
number of threads: 1
duration: 10 s
number of transactions actually processed: 191
number of failures: 14 (7.330 %)
latency average = 356.701 ms
latency stddev = 564.942 ms
tps = 18.735807 (including connections establishing)
tps = 18.744898 (excluding connections establishing)

As I understand it, in the documentation "the very last fields" refer to
the aggregation logging which is not used here. So what's the problem?

> If you want a paragraph you should add <para>, skipping a line does not
> work (around "All values are computed for ...").

Sorry, thanks =[

> I do not understand the second note of the --max-tries documentation.
> It seems to suggest that some script may not end their own
> transaction...
> which should be an error in my opinion? Some explanations would be
> welcome.

As you told me here [1], "I disagree about exit in ParseScript if the
transaction block is not completed <...> and would break an existing
feature.". Maybe it's be better to say this:

In pgbench you can use scripts in which the transaction blocks do not
end. Be careful in this case because transactions that span over more
than one script are not rolled back and will not be retried in case of
an error. In such cases, the script in which the error occurred is
reported as failed.


> I'm not sure that "Retries" deserves a type of its own for two
> counters.


> The "retries" in RetriesState may be redundant with these.

The "retries" in RetriesState have a different goal: they sum up not all
the retries during the execution of the current script but the retries
for the current transaction.

> The failures are counted on simple counters while retries have a type,
> this is not consistent. I suggest to just use simple counters
> everywhere.


> I'm ok with having the detail report tell about failures & retries only
> when some occured.


> typo: sucessufully -> successfully

Thanks! =[

> If a native English speaker could provide an opinion on that, and more
> generally review the whole documentation, it would be great.

I agree with you))

> I think that the rand functions should really take a random_state
> pointer
> argument, not a Thread or Client.

Thanks, I agree.

> I'm at odds that FailureStatus does not have a clean NO_FAILURE state,
> and that it is merged with misc failures.

:) It is funny but for the code it really did not matter)

> I'm not sure that initRetries, mergeRetries, getAllRetries really
> deserve a function.


> I do not thing that there should be two accum Functions. Just extend
> the existing one, and adding zero to zero is not a problem.


> I guess that in the end pgbench & psql variables will have to be merged
> if pgbench expression engine is to be used by psql as well, but this is
> not linked to this patch.


> The tap tests seems over-complicated and heavy with two pgbench run in
> parallel... I'm not sure we really want all that complexity for this
> somehow small feature. Moreover pgbench can run several scripts, I'm
> not
> sure why two pgbench would need to be invoked. Could something much
> simpler and lighter be proposed instead to test the feature?

Firstly, two pgbench need to be invoked because we don't know which of
them will get a deadlock failure. Secondly, I tried much simplier tests
but all of them failed sometimes although everything was ok:
- tests in which pgbench runs 5 clients and 10 transactions per client
for a serialization/deadlock failure on any client (sometimes there are
no failures when it is expected that they will be)
- tests in which pgbench runs 30 clients and 400 transactions per client
for a serialization/deadlock failure on any client (sometimes there are
no failures when it is expected that they will be)
- tests in which the psql session starts concurrently and you use sleep
commands to wait pgbench for 10 seconds (sometimes it does not work)
Only advisory locks help me not to get such errors in the tests :(

> The added code does not conform to Pg C style. For instance, if brace
> should be aligned to the if. Please conform the project style.

I'm sorry, thanks =[

> The is_transaction_block_end seems simplistic. ISTM that it would not
> work with compound commands. It should be clearly documented somewhere.

Thanks, I'll fix it.

> Also find attached two scripts I used for some testing:
> psql < dl_init.sql
> pgbench -f dl_trans.sql -c 8 -T 10 -P 1


Marina Polyakova
Postgres Professional:
The Russian Postgres Company

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Thomas Munro 2017-12-05 07:49:45 Re: explain analyze output with parallel workers - question about meaning of information for
Previous Message Ashutosh Bapat 2017-12-05 06:03:01 Re: Mention ordered datums in PartitionBoundInfoData comment