Still another race condition in recovery TAP tests

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: pgsql-hackers(at)postgreSQL(dot)org
Subject: Still another race condition in recovery TAP tests
Date: 2017-09-09 02:32:03
Message-ID: 24621.1504924323@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

In a moment of idleness I tried to run the TAP tests on pademelon,
which is a mighty old and slow machine. Behold,
src/test/recovery/t/010_logical_decoding_timelines.pl fell over,
with the relevant section of its log contents being:

# testing logical timeline following with a filesystem-level copy
# Taking filesystem backup b1 from node "master"
# pg_start_backup: 0/2000028
could not opendir(/home/postgres/pgsql/src/test/recovery/tmp_check/t_010_logical_decoding_timelines_master_data/pgdata/pg_wal/archive_status/000000010000000000000001.ready): No such file or directory at ../../../src/test/perl//RecursiveCopy.pm line 115.
### Stopping node "master" using mode immediate

The postmaster log has this relevant entry:

2017-09-08 22:03:22.917 EDT [19160] DEBUG: archived write-ahead log file "000000010000000000000001"

It looks to me like the archiver removed 000000010000000000000001.ready
between where RecursiveCopy.pm checks that $srcpath is a regular file
or directory (line 95) and where it rechecks whether it's a regular
file (line 105). Then the "-f" test on line 105 fails, allowing it to
fall through to the its-a-directory path, and unsurprisingly the opendir
at line 115 fails with the above symptom.

In short, RecursiveCopy.pm is woefully unprepared for the idea that the
source directory tree might be changing underneath it.

I'm not real sure if the appropriate answer to this is "we need to fix
RecursiveCopy" or "we need to fix the callers to not call RecursiveCopy
until the source directory is stable". Thoughts?

(I do kinda wonder why we rolled our own RecursiveCopy; surely there's
a better implementation in CPAN?)

regards, tom lane

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Amit Kapila 2017-09-09 02:44:32 Re: [POC] Faster processing at Gather node
Previous Message Amit Kapila 2017-09-09 02:26:39 Re: Parallel worker error