| From: | Andres Freund <andres(at)anarazel(dot)de> | 
|---|---|
| To: | Noah Misch <noah(at)leadboat(dot)com> | 
| Cc: | Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Nathan Bossart <nathandbossart(at)gmail(dot)com>, Michael Paquier <michael(at)paquier(dot)xyz>, Robert Pang <robertpang(at)google(dot)com>, Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>, robertmhaas(at)gmail(dot)com, pgsql-hackers(at)postgresql(dot)org | 
| Subject: | Re: Back-patch of: avoid multiple hard links to same WAL file after a crash | 
| Date: | 2025-04-25 19:35:06 | 
| Message-ID: | f7ekxpwertlg2k4ux6dexi23k6n63fq5f7w5v3k5r556sw7dh7@ukyye6rmw6uv | 
| Views: | Whole Thread | Raw Message | Download mbox | Resend email | 
| Thread: | |
| Lists: | pgsql-hackers | 
Hi,
On 2025-04-20 14:53:39 -0700, Noah Misch wrote:
> On Mon, Apr 14, 2025 at 09:19:35AM +0900, Michael Paquier wrote:
> > On Sun, Apr 13, 2025 at 11:51:57AM -0400, Tom Lane wrote:
> > > Noah Misch <noah(at)leadboat(dot)com> writes:
> > > > Tom and Michael, do you still object to the test addition, or not?  If there
> > > > are no new or renewed objections by 2025-04-20, I'll proceed to add the test.
> 
> Pushed as commit 714bd9e.  The failure so far is
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=skink&dt=2025-04-20%2015%3A36%3A35
> with these highlights:
> 
> pg_ctl: server does not shut down
> 
> 2025-04-20 17:27:35.735 UTC [1576688][postmaster][:0] LOG:  received immediate shutdown request
> 2025-04-20 17:27:35.969 UTC [1577386][archiver][:0] FATAL:  archive command was terminated by signal 3: Quit
> 2025-04-20 17:27:35.969 UTC [1577386][archiver][:0] DETAIL:  The failed archive command was: cp "pg_wal/00000001000000000000006D" "/home/bf/bf-build/skink-master/HEAD/pgsql.build/testrun/recovery/045_archive_restartpoint/data/t_045_archive_restartpoint_primary_data/archives/00000001000000000000006D"
> 
> The checkpoints and WAL creation took 30s, but archiving was only 20% done
> (based on file name 00000001000000000000006D) at the 360s PGCTLTIMEOUT.
Huh.  That seems surprisingly slow, even for valgrind.  I guess it's one more
example for why the single-threaded archiving approach sucks so badly :)
> I can reproduce this if I test with valgrind --trace-children=yes.  With my
> normal valgrind settings, the whole test file takes only 18s.  I recommend
> one of these changes to skink:
> 
> - Add --trace-children-skip='/bin/*,/usr/bin/*' so valgrind doesn't instrument
>   "sh" and "cp" commands.
> - Remove --trace-children=yes
Hm. I think I used --trace-children=yes because I was thinking it was required
to track forks. But a newer version of valgrind's man page has an important
clarification:
       --trace-children=<yes|no> [default: no]
           When enabled, Valgrind will trace into sub-processes initiated via the exec system call. This is necessary for multi-process programs.
 
           Note that Valgrind does trace into the child of a fork (it would be difficult not to, since fork makes an identical copy of a process), so this
           option is arguably badly named. However, most children of fork calls immediately call exec anyway.
So there doesn't seem to be much point in using --trace-children=yes.
> Andres, what do you think about making one of those skink configuration
> changes?  Alternatively, I could make the test poll until archiving catches
> up.  However, that would take skink about 30min, and I expect little value
> from 30min of valgrind instrumenting the "cp" command.
I just changed the config to --trace-children=no. There already is a valgrind
run in progress, so it won't be in effect for the next run.
Greetings,
Andres Freund
| From | Date | Subject | |
|---|---|---|---|
| Next Message | Andres Freund | 2025-04-25 19:58:29 | Re: gcc 15 "array subscript 0" warning at level -O3 | 
| Previous Message | David E. Wheeler | 2025-04-25 19:23:47 | Re: RFC: Additional Directory for Extensions |