Re: Still another race condition in recovery TAP tests

From: Michael Paquier <michael(dot)paquier(at)gmail(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: PostgreSQL mailing lists <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Still another race condition in recovery TAP tests
Date: 2017-09-09 04:18:01
Message-ID: CAB7nPqQyEmbrY-Omafz=n31O77gEsd2hww19QUkpnA=-M-mBGQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox
Thread:
Lists: pgsql-hackers

On Sat, Sep 9, 2017 at 11:32 AM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> In a moment of idleness I tried to run the TAP tests on pademelon,
> which is a mighty old and slow machine.

How long did it take? Just wondering if that's actually the slowest
one or not to run the full set of recovery tests. This would be mighty
slow. Even hamster never detected this failure I think.

> It looks to me like the archiver removed 000000010000000000000001.ready
> between where RecursiveCopy.pm checks that $srcpath is a regular file
> or directory (line 95) and where it rechecks whether it's a regular
> file (line 105). Then the "-f" test on line 105 fails, allowing it to
> fall through to the its-a-directory path, and unsurprisingly the opendir
> at line 115 fails with the above symptom.
>
> In short, RecursiveCopy.pm is woefully unprepared for the idea that the
> source directory tree might be changing underneath it.

Yes, before 010_logical_decoding_timelines and _backup_fs were
introduced, we only used RecursiveCopy with a static source in
init_from_backup(), which represented no risks.

> I'm not real sure if the appropriate answer to this is "we need to fix
> RecursiveCopy" or "we need to fix the callers to not call RecursiveCopy
> until the source directory is stable". Thoughts?

We could make RecursiveCopy smarter here. In backup_fs_hot, if for
example another segment switch happens between pg_start_backup() and
CopyRecursive, say with a low archive_timeout with a TAP test using
this configuration, then we would stay at the same point. Another
thing that we could actually use here is the fact that _backup_fs
assumes that archives must be enabled (a sanity check would be nice!),
so we could just exclude pg_wal from the FS-backup and avoid any
problems, at least for this test butthat won't save from any other
paths getting modified on disk. So making RecursiveCopy really looks
like the best bet in the long term.

> (I do kinda wonder why we rolled our own RecursiveCopy; surely there's
> a better implementation in CPAN?)

This comes from here, which is something I got involved in:
commit: 1caef31d9e550408d0cbc5788a422dcb69736df5
author: Alvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org>
date: Wed, 2 Dec 2015 18:46:16 -0300
Refactor Perl test code

So the idea here is really to have:
1) Something compatible down to perl 5.8.
2) Something which is not external, which is where we could have used
File::Copy::Recursive (only compatible from 5.12 if I recall
correctly? I am too lazy to check now). Note as well that a copy of
the whole directory is used for backups instead of tar format of
pg_basebackup because there is no easy support for tar files down to
5.8.
--
Michael

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2017-09-09 04:43:45 Re: Still another race condition in recovery TAP tests
Previous Message Thomas Munro 2017-09-09 04:05:57 Re: Patch: add --if-exists to pg_recvlogical