Re: Why is src/test/modules/committs/t/002_standby.pl flaky?

From: Alexander Lakhin <exclusion(at)gmail(dot)com>
To: Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Noah Misch <noah(at)leadboat(dot)com>, Andrew Dunstan <andrew(at)dunslane(dot)net>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Why is src/test/modules/committs/t/002_standby.pl flaky?
Date: 2023-11-21 11:00:00
Message-ID: 32d112ee-0b6f-d4ab-441b-e2bba66a1d83@gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hello Thomas,

10.11.2023 06:31, Thomas Munro wrote:
> Here is a new attempt to fix this mess. Disclaimer: this based
> entirely on reading the manual and vicariously hacking a computer I
> don't have via CI.

As it also might (and I would like it to) be the final attempt, I decided
to gather information and all the cases that we had on this topic.
At least, for the last 5 years we've seen:

[1] 2019-01-18: Re: BUG #15598: PostgreSQL Error Code is not reported when connection terminated due to
idle-in-transaction timeout
    test 099_case_15598 made (in attachment)
no commit

[2] 2019-01-22: Rare SSL failures on eelpout
    references [1]
    test 099_rare_ssl_failures made (in attachment)
commit 2019-03-19 1f39a1c06: Restructure libpq's hqandling of send failures.

[3] 2019-11-28: pgsql: Add tests for '-f' option in dropdb utility.
    test 051_dropdb_force was proposed (included in the attachment)
commit 2019-11-30 98a9b37ba: Revert commits 290acac92b and 8a7e9e9dad.

[4] 2019-12-06: closesocket behavior in different platforms
    references [1], [2], [3]; a documentation change proposed
no commit

[5] 2020-06-03: libpq copy error handling busted
    test 099_pgbench_with_server_off made (in attachment)
commit 2020-06-07 7247e243a: Try to read data from the socket in pqSendSome's write_failed paths. (a fix for 1f39a1c06)

[6] 2020-10-19: BUG #16678: The ecpg connect/test5 test sometimes fails on Windows
no commit

[7] 2021-11-17: Windows: Wrong error message at connection termination
   references [6]
commit: 2021-12-02 6051857fc: On Windows, close the client socket explicitly during backend shutdown.

[8] 2021-12-05 17:03:18: MSVC SSL test failure
commit: 2021-12-07 ed52c3707: On Windows, also call shutdown() while closing the client socket.

[9] 2021-12-30: Why is src/test/modules/committs/t/002_standby.pl flaky?
   additional test 099_postgres_fdw_disconnect made (in attachment)
commit 2022-01-26 75674c7ec: Revert "graceful shutdown" changes for Windows, in back branches only. (REL_14_STABLE)
commit 2022-03-22 29992a6a5: Revert "graceful shutdown" changes for Windows. (master)

[10] 2022-02-02 19:19:22: BUG #17391: While using --with-ssl=openssl and PG_TEST_EXTRA='ssl' options, SSL tests fail on
OpenBSD 7.0
commit 2022-02-12 335fa5a26: Fix thinko in PQisBusy(). (a fix for 1f39a1c06)
commit 2022-02-12 faa189c93: Move libpq's write_failed mechanism down to pqsecure_raw_write(). (a fix for 1f39a1c06)

As it becomes difficult to test/check all those cases scattered around, I
decided to retell the whole story by means of tests. Please look at the
script win-sock-tests-01.cmd attached, which can be executed both on
Windows (as regular cmd) and on Linux (bash win-sock-tests-01.cmd).

At the end of the script we can see several things.
First, the last patchset you posted, applied to b282fa88d~1, fixes the
issue discussed in this thread (it eliminates failures of
commit_ts_002_standby (and also 099_postgres_fdw_disconnect)).

Second, with the sleep added (see [6]), I had the same results of
`meson test` on Windows and on Linux.
Namely, there are some tests failing (see win-sock-tests-01.cmd) due to
walsender preventing server stop.
I describe this issue separately (see details in walsender-cannot-exit.txt;
maybe it's worth to discuss it in a separate thread) as it's kind of
off-topic. With the supplementary sleep() added to WalSndLoop(), the
complete `meson test` passes successfully both on Windows and on Linux.

Third, cases [1] and [3] are still broken, due to a Windows peculiarity.
Please see server.c and client.c attached, which demonstrate:
Case "no shutdown/closesocket" on Windows:
C:\src>server
Listening for incoming connections...
                        C:\src>client
Client connected: 127.0.0.1:64395
                        Connection to server established. Enter message: msg
Client message: msg
Sending message...
                        Sleeping...
Exiting...
C:\src>
                        Calling recv()...
                        recv() failed

Case "no shutdown/closesocket" on Linux:
$ server
Listening for incoming connections...
                        $ client
Client connected: 127.0.0.1:33044
                        Connection to server established. Enter message: msg
Client message: msg
Sending message...
                        Sleeping...
Exiting...
$
                        Calling recv()...
                        Server message: MESSAGE

Case "shutdown/closesocket" on Windows:
C:\src>server shutdown closesocket
Listening for incoming connections...
                        C:\src>client
Client connected: 127.0.0.1:64396
                        Connection to server established. Enter message: msg
Client message: msg
Sending message...
                        Sleeping...
Calling shutdown()...
Calling closesocket()...
Exiting...
C:\src>
                        Calling recv()...
                        Server message: MESSAGE

That's okay so far, but what makes cases [1]/[3] different from all cases
in the whole existing test suite, which now performed successfully, is
that psql calls send() before recv() on a socket closed and abandoned by
the server.
Those programs show on Windows:
C:\src>server shutdown closesocket
Listening for incoming connections...
                        C:\src>client send_before_recv
Client connected: 127.0.0.1:64397
                        Connection to server established. Enter message: msg
Client message: msg
Sending message...
                        Sleeping...
Calling shutdown()...
Calling closesocket()...
Exiting...
C:\src>
                        send() returned 4
                        Calling recv()...
                        recv() failed

As known, on Linux the same scenario works just fine.

Fourth, tests 099_rare_ssl_failures (and 001_ssl_tests, though more rarely)
fail for me with the latest patches (only on Windows again):
...
 8/10 postgresql:ssl_1 / ssl_1/099_rare_ssl_failures ERROR           141.34s   exit status 3
...
 9/10 postgresql:ssl_7 / ssl_7/099_rare_ssl_failures OK              142.52s   2000 subtests passed
10/10 postgresql:ssl_6 / ssl_6/099_rare_ssl_failures OK              143.00s   2000 subtests passed

Ok:                 2
Expected Fail:      0
Fail:               8

ssl_1\099_rare_ssl_failures\log\regress_log_099_rare_ssl_failures.txt:
...
iteration 354
[20:57:06.984](0.106s) ok 707 - certificate authorization fails with revoked client cert with server-side CRL directory
[20:57:06.984](0.000s) ok 708 - certificate authorization fails with revoked client cert with server-side CRL directory:
matches
iteration 355
[20:57:07.156](0.172s) ok 709 - certificate authorization fails with revoked client cert with server-side CRL directory
[20:57:07.156](0.001s) not ok 710 - certificate authorization fails with revoked client cert with server-side CRL
directory: matches
[20:57:07.159](0.003s) #   Failed test 'certificate authorization fails with revoked client cert with server-side CRL
directory: matches'
#   at .../src/test/ssl_1/t/099_rare_ssl_failures.pl line 88.
[20:57:07.159](0.000s) #                   'psql: error: connection to server at "127.0.0.1", port 59843 failed: could
not receive data from server: Software caused connection abort (0x00002745/10053)
# SSL SYSCALL error: Software caused connection abort (0x00002745/10053)'
#     doesn't match '(?^:SSL error: sslv3 alert certificate revoked)'
...

It seems to me that it can have the same explanation (if openssl can call
send() before recv() under the hood), but maybe it should be investigated
further.

> Review/poking-with-a-stick/trying-to-break-it most welcome.

I could not find anything suspicious in the code, except for maybe a typo
"The are ...".

[1] https://www.postgresql.org/message-id/flat/87k1iy44fd.fsf%40news-spur.riddles.org.uk#ba0c07f13c300d42fd537855dd95dd2b
[2] https://www.postgresql.org/message-id/flat/CAEepm%3D2n6Nv%2B5tFfe8YnkUm1fXgvxR0Mm1FoD%2BQKG-vLNGLyKg%40mail.gmail.com
[3] https://www.postgresql.org/message-id/flat/E1iaD8h-0004us-K9(at)gemulon(dot)postgresql(dot)org
[4]
https://www.postgresql.org/message-id/flat/CALDaNm2tEvr_Kum7SyvFn0%3D6H3P0P-Zkhnd%3DdkkX%2BQ%3DwKutZ%3DA%40mail.gmail.com
[5] https://www.postgresql.org/message-id/flat/20200603201242.ofvm4jztpqytwfye%40alap3.anarazel.de
[6] https://www.postgresql.org/message-id/16678-253e48d34dc0c376@postgresql.org
[7] https://www.postgresql.org/message-id/flat/90b34057-4176-7bb0-0dbb-9822a5f6425b%40greiz-reinsdorf.de
[8] https://www.postgresql.org/message-id/flat/af5e0bf3-6a61-bb97-6cba-061ddf22ff6b%40dunslane.net
[9]
https://www.postgresql.org/message-id/flat/CA%2BhUKG%2BOeoETZQ%3DQw5Ub5h3tmwQhBmDA%3DnuNO3KG%3DzWfUypFAw%40mail.gmail.com
[10] https://www.postgresql.org/message-id/flat/17391-304f81bcf724b58b%40postgresql.org

Best regards,
Alexander

Attachment Content-Type Size
walsender-cannot-exit.txt text/plain 2.2 KB
server.c text/x-csrc 2.1 KB
client.c text/x-csrc 1.7 KB
win-sock-tests-01.patch text/x-patch 19.5 KB
win-sock-tests-01.cmd.txt text/plain 12.0 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Drouvot, Bertrand 2023-11-21 11:05:31 Re: Synchronizing slots from primary to standby
Previous Message Hayato Kuroda (Fujitsu) 2023-11-21 10:37:03 RE: Random pg_upgrade test failure on drongo