Re: Back-patch of: avoid multiple hard links to same WAL file after a crash

From: Noah Misch <noah(at)leadboat(dot)com>
To: Andres Freund <andres(at)anarazel(dot)de>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Nathan Bossart <nathandbossart(at)gmail(dot)com>, Michael Paquier <michael(at)paquier(dot)xyz>, Robert Pang <robertpang(at)google(dot)com>, Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>, robertmhaas(at)gmail(dot)com, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Back-patch of: avoid multiple hard links to same WAL file after a crash
Date: 2025-04-20 21:53:39
Message-ID: 20250420215339.e8.nmisch@google.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Mon, Apr 14, 2025 at 09:19:35AM +0900, Michael Paquier wrote:
> On Sun, Apr 13, 2025 at 11:51:57AM -0400, Tom Lane wrote:
> > Noah Misch <noah(at)leadboat(dot)com> writes:
> > > Tom and Michael, do you still object to the test addition, or not? If there
> > > are no new or renewed objections by 2025-04-20, I'll proceed to add the test.

Pushed as commit 714bd9e. The failure so far is
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=skink&dt=2025-04-20%2015%3A36%3A35
with these highlights:

pg_ctl: server does not shut down

2025-04-20 17:27:35.735 UTC [1576688][postmaster][:0] LOG: received immediate shutdown request
2025-04-20 17:27:35.969 UTC [1577386][archiver][:0] FATAL: archive command was terminated by signal 3: Quit
2025-04-20 17:27:35.969 UTC [1577386][archiver][:0] DETAIL: The failed archive command was: cp "pg_wal/00000001000000000000006D" "/home/bf/bf-build/skink-master/HEAD/pgsql.build/testrun/recovery/045_archive_restartpoint/data/t_045_archive_restartpoint_primary_data/archives/00000001000000000000006D"

The checkpoints and WAL creation took 30s, but archiving was only 20% done
(based on file name 00000001000000000000006D) at the 360s PGCTLTIMEOUT. I can
reproduce this if I test with valgrind --trace-children=yes. With my normal
valgrind settings, the whole test file takes only 18s. I recommend one of
these changes to skink:

- Add --trace-children-skip='/bin/*,/usr/bin/*' so valgrind doesn't instrument
"sh" and "cp" commands.
- Remove --trace-children=yes

Andres, what do you think about making one of those skink configuration
changes? Alternatively, I could make the test poll until archiving catches
up. However, that would take skink about 30min, and I expect little value
from 30min of valgrind instrumenting the "cp" command.

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Noah Misch 2025-04-20 22:15:59 Re: Back-patch of: avoid multiple hard links to same WAL file after a crash
Previous Message Tom Lane 2025-04-20 19:28:51 Re: Memory context can be its own parent and child in replication command