Re: Suggestion to add --continue-client-on-abort option to pgbench

From: Chao Li <li(dot)evan(dot)chao(at)gmail(dot)com>
To: Yugo Nagata <nagata(at)sraoss(dot)co(dot)jp>
Cc: Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, Rintaro Ikeda <ikedarintarof(at)oss(dot)nttdata(dot)com>, Jakub Wartak <jakub(dot)wartak(at)enterprisedb(dot)com>, "Hayato Kuroda (Fujitsu)" <kuroda(dot)hayato(at)fujitsu(dot)com>, "slpmcf(at)gmail(dot)com" <slpmcf(at)gmail(dot)com>, "boekewurm+postgres(at)gmail(dot)com" <boekewurm+postgres(at)gmail(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>, Srinath Reddy Sadipiralla <srinath2133(at)gmail(dot)com>, Dilip Kumar <dilipbalaut(at)gmail(dot)com>
Subject: Re: Suggestion to add --continue-client-on-abort option to pgbench
Date: 2025-11-13 08:13:30
Message-ID: F8633D97-1E57-4C3B-9075-B08B5A297ACB@gmail.com
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

> On Nov 13, 2025, at 15:09, Yugo Nagata <nagata(at)sraoss(dot)co(dot)jp> wrote:
>
> On Thu, 13 Nov 2025 11:55:25 +0900
> Fujii Masao <masao(dot)fujii(at)gmail(dot)com> wrote:
>
>> On Wed, Nov 12, 2025 at 6:34 PM Yugo Nagata <nagata(at)sraoss(dot)co(dot)jp> wrote:
>>> I've attached an updated patch.
>>
>> Thanks for updating the patch!
>>
>>> The comment for the PQpipelineSync() call has been also updated to clarify
>>> why it is necessary.
>>
>> + /*
>> + * If a PGRES_PIPELINE_SYNC is followed by something other than
>> + * PGRES_PIPELINE_SYNC or NULL, another PGRES_PIPELINE_SYNC will
>> + * eventually follow.
>> + */
>>
>> LGTM. I'd like to append "Reset received_sync to false to wait for
>> it." into this comment.
>>
>>
>>> In addition, I added a connection status check in the loop to avoid an
>>> infinte loop that waiting for PQpipelineSync after a connection failure.
>>
>> Would it be better to move this status check right after PQgetResult()
>> so that connection failures can be detected regardless of what result
>> it returns?
>>
>> + pg_log_error("client %d aborted: the backend died while rolling back
>> the failed transaction after",
>>
>> The trailing “after” seems unnecessary.
>>
>> Since there's no guarantee the backend actually died in this case,
>> it might be better to use something like "client %d aborted while rolling back
>> the transaction after an error; perhaps the backend died while processing"
>> which matches the wording used under CSTATE_WAIT_ROLLBACK_RESULT
>> in advanceConnectionState().
>
> Thank you for your review!
> I've attached an updated patch reflecting your suggestion.
>
> Regards,
> Yugo Nagata
>
> --
> Yugo Nagata <nagata(at)sraoss(dot)co(dot)jp>
> <v3-0001-pgbench-Fix-assertion-failure-with-multiple-syncp.patch>

With v3 patch, the assert is gone, but test result is no longer accurate, because discardAvailableResults() discarded PIPELINE_SYNC messages. This is my test result with v3:
```
% pgbench -n --failures-detailed -M extended -j 2 -c 2 -f deadlock.sql -f deadlock2.sql evantest
pgbench (19devel)
EVAN: discardAvailableResults: discarding result: res=11, conn=0
EVAN: discardAvailableResults: discarding result: res=7, conn=0
EVAN: discardAvailableResults: discarding result: res=10, conn=0
EVAN: discardAvailableResults: discarding result: res=2, conn=0
EVAN: discardAvailableResults: discarding result value: 6, conn=0
EVAN: discardAvailableResults: discarding result: res=10, conn=0
EVAN: discardAvailableResults: discarding result: res=2, conn=0
EVAN: discardAvailableResults: discarding result value: 7, conn=0
EVAN: discardUntilSync: Got SYNC, conn=0
EVAN: discardUntilSync: Got result: res=2, conn=0
EVAN: discardUntilSync: discarding result value=8, conn=0
EVAN: discardUntilSync: Got result: res=7, conn=0
EVAN: discardUntilSync: Got SYNC, conn=0
EVAN: discardUntilSync: Got SYNC, conn=0
EVAN: discardUntilSync: Got NULL, conn=0
EVAN: discardAvailableResults: discarding result: res=11, conn=0
EVAN: discardAvailableResults: discarding result: res=7, conn=0
EVAN: discardAvailableResults: discarding result: res=10, conn=0
EVAN: discardAvailableResults: discarding result: res=2, conn=0
EVAN: discardAvailableResults: discarding result value: 6, conn=0
EVAN: discardAvailableResults: discarding result: res=10, conn=0
EVAN: discardAvailableResults: discarding result: res=2, conn=0
EVAN: discardAvailableResults: discarding result value: 7, conn=0
EVAN: discardUntilSync: Got SYNC, conn=0
EVAN: discardUntilSync: Got result: res=2, conn=0
EVAN: discardUntilSync: discarding result value=8, conn=0
EVAN: discardUntilSync: Got result: res=7, conn=0
EVAN: discardUntilSync: Got SYNC, conn=0
EVAN: discardUntilSync: Got SYNC, conn=0
EVAN: discardUntilSync: Got NULL, conn=0
EVAN: discardAvailableResults: discarding result: res=11, conn=0
EVAN: discardAvailableResults: discarding result: res=7, conn=0
EVAN: discardAvailableResults: discarding result: res=10, conn=0
EVAN: discardAvailableResults: discarding result: res=2, conn=0
EVAN: discardAvailableResults: discarding result value: 2, conn=0
EVAN: discardAvailableResults: discarding result: res=10, conn=0
EVAN: discardAvailableResults: discarding result: res=2, conn=0
EVAN: discardAvailableResults: discarding result value: 3, conn=0
EVAN: discardUntilSync: Got SYNC, conn=0
EVAN: discardUntilSync: Got result: res=2, conn=0
EVAN: discardUntilSync: discarding result value=4, conn=0
EVAN: discardUntilSync: Got result: res=7, conn=0
EVAN: discardUntilSync: Got SYNC, conn=0
EVAN: discardUntilSync: Got SYNC, conn=0
EVAN: discardUntilSync: Got NULL, conn=0
EVAN: discardAvailableResults: discarding result: res=11, conn=0
EVAN: discardAvailableResults: discarding result: res=7, conn=0
EVAN: discardAvailableResults: discarding result: res=10, conn=0
EVAN: discardAvailableResults: discarding result: res=2, conn=0
EVAN: discardAvailableResults: discarding result value: 6, conn=0
EVAN: discardAvailableResults: discarding result: res=10, conn=0
EVAN: discardAvailableResults: discarding result: res=2, conn=0
EVAN: discardAvailableResults: discarding result value: 7, conn=0
EVAN: discardUntilSync: Got SYNC, conn=0
EVAN: discardUntilSync: Got result: res=2, conn=0
EVAN: discardUntilSync: discarding result value=8, conn=0
EVAN: discardUntilSync: Got result: res=7, conn=0
EVAN: discardUntilSync: Got SYNC, conn=0
EVAN: discardUntilSync: Got SYNC, conn=0
EVAN: discardUntilSync: Got NULL, conn=0
EVAN: discardAvailableResults: discarding result: res=11, conn=0
EVAN: discardAvailableResults: discarding result: res=7, conn=0
EVAN: discardAvailableResults: discarding result: res=10, conn=0
EVAN: discardAvailableResults: discarding result: res=2, conn=0
EVAN: discardAvailableResults: discarding result value: 2, conn=0
EVAN: discardAvailableResults: discarding result: res=10, conn=0
EVAN: discardAvailableResults: discarding result: res=2, conn=0
EVAN: discardAvailableResults: discarding result value: 3, conn=0
EVAN: discardUntilSync: Got SYNC, conn=0
EVAN: discardUntilSync: Got result: res=2, conn=0
EVAN: discardUntilSync: discarding result value=4, conn=0
EVAN: discardUntilSync: Got result: res=7, conn=0
EVAN: discardUntilSync: Got SYNC, conn=0
EVAN: discardUntilSync: Got SYNC, conn=0
EVAN: discardUntilSync: Got NULL, conn=0
transaction type: multiple scripts
scaling factor: 1
query mode: extended
number of clients: 2
number of threads: 2
maximum number of tries: 1
number of transactions per client: 10
number of transactions actually processed: 15/20
number of failed transactions: 5 (25.000%)
number of serialization failures: 0 (0.000%)
number of deadlock failures: 5 (25.000%)
number of other failures: 0 (0.000%)
latency average = 502.741 ms (including failures)
initial connection time = 2.882 ms
tps = 2.983644 (without initial connection time)
SQL script 1: deadlock.sql
- weight: 1 (targets 50.0% of total)
- 11 transactions (55.0% of total)
- number of transactions actually processed: 9 (tps = 1.790186)
- number of failed transactions: 2 (18.182%)
- number of serialization failures: 0 (0.000%)
- number of deadlock failures: 2 (18.182%)
- number of other failures: 0 (0.000%)
- latency average = 336.030 ms
- latency stddev = 472.160 ms
SQL script 2: deadlock2.sql
- weight: 1 (targets 50.0% of total)
- 9 transactions (45.0% of total)
- number of transactions actually processed: 6 (tps = 1.193457)
- number of failed transactions: 3 (33.333%)
- number of serialization failures: 0 (0.000%)
- number of deadlock failures: 3 (33.333%)
- number of other failures: 0 (0.000%)
- latency average = 335.757 ms
- latency stddev = 472.107 ms
```

We can see:
* number of transactions actually processed: 15/20
* number of failed transactions: 5 (25.000%)

However, with the dirty diff I sent in the morning:
```
% pgbench -n --failures-detailed -M extended -j 2 -c 2 -f deadlock.sql -f deadlock2.sql evantest
… omit debug logs …
transaction type: multiple scripts
scaling factor: 1
query mode: extended
number of clients: 2
number of threads: 2
maximum number of tries: 1
number of transactions per client: 10
number of transactions actually processed: 20/20
number of failed transactions: 0 (0.000%)
number of serialization failures: 0 (0.000%)
number of deadlock failures: 0 (0.000%)
number of other failures: 0 (0.000%)
latency average = 302.863 ms
initial connection time = 2.749 ms
tps = 6.603655 (without initial connection time)
SQL script 1: deadlock.sql
- weight: 1 (targets 50.0% of total)
- 11 transactions (55.0% of total)
- number of transactions actually processed: 11 (tps = 3.632010)
- number of failed transactions: 0 (0.000%)
- number of serialization failures: 0 (0.000%)
- number of deadlock failures: 0 (0.000%)
- number of other failures: 0 (0.000%)
- latency average = 275.532 ms
- latency stddev = 445.629 ms
SQL script 2: deadlock2.sql
- weight: 1 (targets 50.0% of total)
- 9 transactions (45.0% of total)
- number of transactions actually processed: 9 (tps = 2.971645)
- number of failed transactions: 0 (0.000%)
- number of serialization failures: 0 (0.000%)
- number of deadlock failures: 0 (0.000%)
- number of other failures: 0 (0.000%)
- latency average = 336.150 ms
- latency stddev = 472.091 ms
```

Now, all transactions are processed, there is no failure, I think that is expected, because syncpipeline should rollback failures, so that all script should succeed.

Feels to me like, because of introducing the new discardAvailableResults(), we need to make different fixes for master and old branches.

Best regards,
--
Chao Li (Evan)
HighGo Software Co., Ltd.
https://www.highgo.com/

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Álvaro Herrera 2025-11-13 08:13:40 Re: Few untranslated error messages in OAuth
Previous Message Corey Huinker 2025-11-13 07:32:17 Re: Extended Statistics set/restore/clear functions.