Re: Recent 027_streaming_regress.pl hangs

From: Alexander Lakhin <exclusion(at)gmail(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>
Cc: Michael Paquier <michael(at)paquier(dot)xyz>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Recent 027_streaming_regress.pl hangs
Date: 2024-03-19 11:00:00
Message-ID: ed767fdd-bc7a-2a0b-d94a-b4a724af50bf@gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

14.03.2024 23:56, Tom Lane wrote:
> Thomas Munro <thomas(dot)munro(at)gmail(dot)com> writes:
>> On Fri, Mar 15, 2024 at 7:00 AM Alexander Lakhin <exclusion(at)gmail(dot)com> wrote:
>>> Could it be that the timeout (360 sec?) is just not enough for the test
>>> under the current (changed due to switch to meson) conditions?
>> But you're right that under meson the test takes a lot longer, I guess
>> due to increased concurrency:
> What it seems to be is highly variable. Looking at calliphoridae's
> last half dozen successful runs, I see reported times for
> 027_stream_regress anywhere from 183 to 704 seconds. I wonder what
> else is happening on that machine. Also, this is probably not
> helping anything:
>
> 'extra_config' => {
> ...
> 'fsync = on'
>
> I would suggest turning that off and raising wait_timeout a good
> deal, and then we'll see if calliphoridae gets any more stable.

I could reproduce similar failures with
PG_TEST_EXTRA=wal_consistency_checking
only, running 5 tests in parallel on a slowed-down VM, so that test
duration increased to ~1900 seconds, but perhaps that buildfarm machine
has a different bottleneck (I/O?) or it's concurrent workload is not
uniform as in my experiments.

Meanwhile, I've analyzed failed test logs from buildfarm and calculated
the percentage of WAL replayed before timeout.
For instance, one of the failures:
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=calliphoridae&dt=2024-03-18%2022%3A36%3A40
standby_1.log:
2024-03-18 22:38:22.743 UTC [2010896][walreceiver][:0] LOG:  started streaming WAL from primary at 0/3000000 on timeline 1
...
2024-03-18 22:50:02.439 UTC [2004203][checkpointer][:0] LOG: recovery restart point at 0/E00E030
2024-03-18 22:50:02.439 UTC [2004203][checkpointer][:0] DETAIL: Last completed transaction was at log time 2024-03-18
22:41:26.647756+00.
2024-03-18 22:50:12.841 UTC [2010896][walreceiver][:0] FATAL:  could not receive data from WAL stream: server closed the
connection unexpectedly

primary.log:
2024-03-18 22:38:23.754 UTC [2012240][client backend][3/3:0] LOG: statement: GRANT ALL ON SCHEMA public TO public;
# ^^^ One of the first records produced by `make check`
...
2024-03-18 22:41:26.647 UTC [2174047][client backend][10/752:0] LOG:  statement: ALTER VIEW my_property_secure SET
(security_barrier=false);
# ^^^ A record near the last completed transaction on standby
...
2024-03-18 22:44:13.226 UTC [2305844][client backend][22/3784:0] LOG:  statement: DROP TABLESPACE regress_tblspace_renamed;
# ^^^ One of the last records produced by `make check`

\set t0 '22:38:23.754' \set t1 '22:44:13.226' \set tf '22:41:26.647756'
select extract(epoch from (:'tf'::time - :'t0'::time)) / extract(epoch from (:'t1'::time - :'t0'::time));
~52%

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=calliphoridae&dt=2024-03-18%2018%3A58%3A58
~48%

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=calliphoridae&dt=2024-03-18%2016%3A41%3A13
~43%

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=calliphoridae&dt=2024-03-18%2015%3A47%3A09
~36%

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=calliphoridae&dt=2024-03-15%2011%3A24%3A38
~87%

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=skink&dt=2024-03-17%2021%3A55%3A41
~36%

So it still looks like a performance-related issue to me. And yes,
fsync = off -> on greatly increases (~3x) the overall test duration in
that my environment.

Best regards,
Alexander

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Peter Eisentraut 2024-03-19 11:02:45 Re: SQL:2011 application time
Previous Message Amit Kapila 2024-03-19 10:50:35 Re: Introduce XID age and inactive timeout based replication slot invalidation