Re: standby recovery fails (tablespace related) (tentative patch and discussion)

From: Thomas Munro <thomas(dot)munro(at)gmail(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Alvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org>, Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>, Matthias van de Meent <boekewurm+postgres(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: standby recovery fails (tablespace related) (tentative patch and discussion)
Date: 2022-07-30 23:08:01
Message-ID: CA+hUKGKU3p1=eSE7gQHaH=GxPLsVYCRnDdUaHcu_aV2r4GwdPA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Sun, Jul 31, 2022 at 4:51 AM Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> BTW, quite aside from stability, is it really necessary for this test to
> be so freakin' slow? florican for instance reports
>
> [12:43:38] t/025_stuck_on_old_timeline.pl ....... ok 49010 ms ( 0.00 usr 0.00 sys + 3.64 cusr 2.49 csys = 6.13 CPU)
> [12:44:12] t/026_overwrite_contrecord.pl ........ ok 34751 ms ( 0.01 usr 0.00 sys + 3.14 cusr 1.76 csys = 4.91 CPU)
> [12:49:00] t/027_stream_regress.pl .............. ok 287278 ms ( 0.00 usr 0.00 sys + 9.66 cusr 6.95 csys = 16.60 CPU)
> [12:50:04] t/028_pitr_timelines.pl .............. ok 64543 ms ( 0.00 usr 0.00 sys + 3.59 cusr 3.20 csys = 6.78 CPU)
> [12:50:17] t/029_stats_restart.pl ............... ok 12505 ms ( 0.02 usr 0.00 sys + 3.16 cusr 1.40 csys = 4.57 CPU)
> [12:50:51] t/030_stats_cleanup_replica.pl ....... ok 33933 ms ( 0.01 usr 0.01 sys + 3.55 cusr 2.46 csys = 6.03 CPU)
> [12:51:25] t/031_recovery_conflict.pl ........... ok 34249 ms ( 0.00 usr 0.00 sys + 3.37 cusr 2.20 csys = 5.57 CPU)
> [12:52:09] t/032_relfilenode_reuse.pl ........... ok 44274 ms ( 0.01 usr 0.00 sys + 3.21 cusr 2.05 csys = 5.27 CPU)
> [12:54:07] t/033_replay_tsp_drops.pl ............ ok 117840 ms ( 0.01 usr 0.00 sys + 8.72 cusr 5.41 csys = 14.14 CPU)
>
> 027 is so bloated because it runs the core regression tests YA time,
> which I'm not very happy about either; but that's no excuse for
> every new test to contribute an additional couple of minutes.

Complaints about 027 noted, I'm thinking about what we could do about that.

As for 033, I worried that it might be the new ProcSignalBarrier stuff
around tablespaces, but thankfully the DEBUG logging I added there
recently shows those all completing in single digit milliseconds. I
also confirmed there are no unexpected fsync'd being produced here.

That is quite a lot of CPU, but it's a huge amount of total runtime.
It runs in 5-8 seconds on various modern systems, 19 seconds on my
Linux RPi4, and 50 seconds on my Celeron-powered NAS box with spinning
disks.

I noticed this is a 32 bit FBSD system. Is it running on UFS, perhaps
on slow storage? Are soft updates enabled (visible as options in
output of "mount")? Without soft updates, a lot more file system ops
perform synchronous I/O, which really slows down our tests. In
general, UFS isn't as good as modern file systems at avoiding I/O for
short-lived files, and we set up and tear down a lot of them in our
testing. Another thing that makes a difference is to use a filesystem
with 8KB block size. This has been a subject of investigation for
speeding up CI (see src/tools/ci/gcp_freebsd_repartition.sh), but
several mysteries remain unsolved...

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2022-07-30 23:09:51 Regression coverage for contrib/tcn
Previous Message Tom Lane 2022-07-30 22:08:48 Reducing the maintenance overhead of test_oat_hooks