Re: backup manifests and contemporaneous buildfarm failures

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Thomas Munro <thomas(dot)munro(at)gmail(dot)com>
Cc: Fabien COELHO <coelho(at)cri(dot)ensmp(dot)fr>, Robert Haas <robertmhaas(at)gmail(dot)com>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, David Steele <david(at)pgmasters(dot)net>, Andres Freund <andres(at)anarazel(dot)de>, Noah Misch <noah(at)leadboat(dot)com>, Stephen Frost <sfrost(at)snowman(dot)net>, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Suraj Kharage <suraj(dot)kharage(at)enterprisedb(dot)com>, tushar <tushar(dot)ahuja(at)enterprisedb(dot)com>, Rajkumar Raghuwanshi <rajkumar(dot)raghuwanshi(at)enterprisedb(dot)com>, Rushabh Lathia <rushabh(dot)lathia(at)gmail(dot)com>, Tels <nospam-pg-abuse(at)bloodgate(dot)com>, Andrew Dunstan <andrew(dot)dunstan(at)2ndquadrant(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>, Jeevan Chalke <jeevan(dot)chalke(at)enterprisedb(dot)com>, vignesh C <vignesh21(at)gmail(dot)com>
Subject: Re: backup manifests and contemporaneous buildfarm failures
Date: 2020-04-03 22:48:01
Message-ID: 26044.1585954081@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Thomas Munro <thomas(dot)munro(at)gmail(dot)com> writes:
> Same here, on elver. I see pg_subtrans has been chmod(0)'d,
> presumably by the perl subroutine mutilate_open_directory_fails. I
> see this in my inbox (the build farm wrote it to stderr or stdout
> rather than the log file):

> cannot chdir to child for
> pgsql.build/src/bin/pg_validatebackup/tmp_check/t_003_corruption_master_data/backup/open_directory_fails/pg_subtrans:
> Permission denied at ./run_build.pl line 1013.
> cannot remove directory for
> pgsql.build/src/bin/pg_validatebackup/tmp_check/t_003_corruption_master_data/backup/open_directory_fails:
> Directory not empty at ./run_build.pl line 1013.

I'm guessing that we're looking at a platform-specific difference in
whether "rm -rf" fails outright on an unreadable subdirectory, or
just tries to carry on by unlinking it anyway.

A partial fix would be to have the test script put back normal
permissions on that directory before it exits ... but any failure
partway through the script would leave a time bomb requiring manual
cleanup.

On the whole, I'd argue that testing that behavior is not valuable
enough to take risks of periodically breaking buildfarm members
in a way that will require manual recovery --- to say nothing of
annoying developers who trip over it. So my vote is to remove
that part of the test and be satisfied with checking the behavior
for an unreadable file.

This doesn't directly explain the failure-at-next-configure behavior
that we're seeing in the buildfarm, but it wouldn't be too surprising
if it ends up being that the buildfarm client script doesn't manage
to fully recover from the situation.

regards, tom lane

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Andres Freund 2020-04-03 22:53:40 vacuum_defer_cleanup_age inconsistently applied on replicas
Previous Message Stephen Frost 2020-04-03 22:39:41 Re: backup manifests and contemporaneous buildfarm failures