From: | Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> |
---|---|
To: | Michael Paquier <michael(at)paquier(dot)xyz> |
Cc: | Alexander Lakhin <exclusion(at)gmail(dot)com>, Nazir Bilal Yavuz <byavuz81(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Andres Freund <andres(at)anarazel(dot)de> |
Subject: | Re: Improve error reporting in 027_stream_regress test |
Date: | 2025-07-29 12:49:49 |
Message-ID: | 869277.1753793389@sss.pgh.pa.us |
Views: | Whole Thread | Raw Message | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
Michael Paquier <michael(at)paquier(dot)xyz> writes:
> On Tue, Jul 29, 2025 at 12:41:39AM -0400, Tom Lane wrote:
>> There is something strange happening on mamba --- not sure what,
>> but its cycle time for the past week has been a lot more than normal.
> We are getting PQPING_NO_RESPONSE meaning a lack of report activity
> from the postmaster. An increase in timeout may help, but the host
> seems like it's facing a high workload so it's not really possible to
> come with a perfect number, just an estimation. How about adding a
> --timeout to pg_isready based on PGCONNECT_TIMEOUT, like in the
> attached?
Yeah, looking closely at the failure logs [1], the standby postmaster
reported pg_isready's connection request in good order at about
08:01:47, but then nothing happened for a few seconds:
2025-07-28 08:01:47.884 EDT [6985:1] [unknown] LOG: connection received: host=[local]
2025-07-28 08:01:48.261 EDT [6985:2] [unknown] LOG: connection authenticated: user="buildfarm" method=trust (/home/buildfarm/bf-data/HEAD/pgsql.build/src/test/recovery/tmp_check/t_027_stream_regress_standby_1_data/pgdata/pg_hba.conf:117)
2025-07-28 08:01:48.261 EDT [6985:3] [unknown] LOG: connection authorized: user=buildfarm database=postgres application_name=027_stream_regress.pl
2025-07-28 08:01:51.552 EDT [6985:4] 027_stream_regress.pl LOG: could not send data to client: Broken pipe
2025-07-28 08:01:51.552 EDT [6985:5] 027_stream_regress.pl FATAL: connection to client lost
2025-07-28 08:01:51.552 EDT [6985:6] 027_stream_regress.pl LOG: disconnection: session time: 0:00:03.670 user=buildfarm database=postgres host=[local]
The client side reported failure at 08:01:50 and a bit, matching
pg_isready's default 3-second timeout:
# Running: pg_isready --host /home/buildfarm/bf-data/tmp/ep8AOH4m7l --port 24780
/home/buildfarm/bf-data/tmp/ep8AOH4m7l:24780 - accepting connections
# Running: pg_isready --host /home/buildfarm/bf-data/tmp/ep8AOH4m7l --port 24781
/home/buildfarm/bf-data/tmp/ep8AOH4m7l:24781 - no response
[08:01:50.899](1505.313s) ok 2 - regression tests pass
[08:01:50.902](0.003s) ok 3 - primary alive after regression test run
[08:01:50.905](0.003s) not ok 4 - standby alive after regression test run
[08:01:50.908](0.003s)
[08:01:50.908](0.000s) # Failed test 'standby alive after regression test run'
# at t/027_stream_regress.pl line 104.
[08:01:50.909](0.001s) # got: '0'
# expected: '1'
So I agree that this looks a lot like pg_isready timed out. It's not
too surprising if the machine was heavily loaded at that point, since
the standby would probably still be replaying the last actions of the
regression tests.
+1 for lengthening the timeout, as I see you already did.
regards, tom lane
[1] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=mamba&dt=2025-07-28%2007%3A46%3A26
From | Date | Subject | |
---|---|---|---|
Next Message | Euler Taveira | 2025-07-29 12:52:37 | Re: event trigger support for PL/Python |
Previous Message | Euler Taveira | 2025-07-29 12:46:33 | Re: refactor backend type lists |