RE: Improving tracking/processing of buildfarm test failures

From: "Hayato Kuroda (Fujitsu)" <kuroda(dot)hayato(at)fujitsu(dot)com>
To: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Cc: 'Alexander Lakhin' <exclusion(at)gmail(dot)com>, Noah Misch <noah(at)leadboat(dot)com>, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Andres Freund <andres(at)anarazel(dot)de>, Andrew Dunstan <andrew(at)dunslane(dot)net>
Subject: RE: Improving tracking/processing of buildfarm test failures
Date: 2025-02-06 08:55:39
Message-ID: OSCPR01MB1496600194BA0F22415C242DCF5F62@OSCPR01MB14966.jpnprd01.prod.outlook.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Dear hackers,

I hope I'm on the correct thread. If not, please let me know.
I found a BF failure, which may be a long-standing test issue. If so, I want you to
update the Wiki page.

I found the failure at [1]. It contains two ERRORs, and I'm saying for
test_decoding/isolation/slot_creation_error.

Failed test
=======
Failed test has two sessions, and they work:

1. Open a session, s1, and start a transaction.
2. Open another session, s2, and try to create a slot. It waits...
3. On s1, execute a pg_terminate_backend($s2).
4. On s1, confirm the slot creation fails.

The actual difference is shown [2]. ' s2_init' is a step for slot creation.
We can see that the isolation tester detects a process termination and exit.

Analysis
=====
I think this is a timing issue.

Usually, when a backend receives SIGTERM, it cancels the current transaction and
exits. I feel we assumed that the isolation tester detects the end of the transaction,
which creates a slot in try_complete_step(). We also assumed that it could be
done before the termination of the process.

However, if the isolation tester misses the end of the transaction, it detects the
termination of the backend process. In this case, PQconsumeInput() returns non-zero
value so that the tester process itself exits. This is implemented in try_complete_step()
like in [3].

How do you think?

[1]: https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=drongo&dt=2025-02-06%2001%3A19%3A14
[2]:
```
@@ -92,23 +92,7 @@
FROM pg_stat_activity
WHERE application_name = 'isolation/slot_creation_error/s2';
<waiting ...>
-step s2_init: <... completed>
-FATAL: terminating connection due to administrator command
-server closed the connection unexpectedly
+PQconsumeInput failed: server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
```
[3]:
```
else if (!PQconsumeInput(conn)) /* select(): data available */
{
fprintf(stderr, "PQconsumeInput failed: %s\n",
PQerrorMessage(conn));
exit(1);
}
```

Best regards,
Hayato Kuroda
FUJITSU LIMITED

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Corey Huinker 2025-02-06 09:35:28 Re: Statistics Import and Export
Previous Message Chiranmoy.Bhattacharya@fujitsu.com 2025-02-06 08:44:35 Re: [PATCH] SVE popcount support