Recovery test failure for recovery_min_apply_delay on hamster

From: Michael Paquier <michael(dot)paquier(at)gmail(dot)com>
To: PostgreSQL mailing lists <pgsql-hackers(at)postgresql(dot)org>
Subject: Recovery test failure for recovery_min_apply_delay on hamster
Date: 2016-03-02 05:04:23
Message-ID: CAB7nPqSAZ9HnUcMoUa30JO2wJ8MnREm18p2a7McRA-ZrJxj3Vw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi all,

I have enabled yesterday the recovery test suite on hamster, and we
did not have to wait long before seeing the first failure on it, the
machine being slow as hell so it is quite good at catching race
conditions:
http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=hamster&dt=2016-03-01%2016%3A00%3A06
Honestly, I did runs on this machine of the test suite, but I didn't
see it, so that's quite sporadic. Yesterday's run worked fine for
example.

In more details, the following problem showed up:
### Running SQL command on node "standby": SELECT count(*) FROM tab_int
not ok 1 - check content with delay of 1s

# Failed test 'check content with delay of 1s'
# at t/005_replay_delay.pl line 39.
# got: '20'
# expected: '10'
### Running SQL command on node "master": SELECT pg_current_xlog_location();
### Running SQL command on node "standby": SELECT count(*) FROM tab_int
ok 2 - check content with delay of 2s

This is a timing issue, caused by the use of recovery_min_apply_delay,
the test doing the following:
1) Set up recovery_min_apply_delay to 2 seconds
2) Start the standby
3) Apply an INSERT on master, save pg_current_xlog_location from master
4) sleep 1s
5) query standby, and wait that WAL has not been applied yet.
6) Wait that required LSN from master has been applied
7) query again standby, and see that WAL has been applied.

The problem is that visibly hamster is so slow that more than 2s have
been spent between phases 3 and 5, meaning that the delay has already
been reached, and WAL was applied.

Here are a couple of ways to address this problem:
1) Remove the check before applying the delay
2) Increase recovery_min_apply_delay to a time that will allow even
slow machines to see a difference. By experience with the other tests
30s would be enough. The sleep time needs to be increased as well,
making the time taken for the test to run longer
3) Remove all together 005, because doing either 1) or 2) reduces the
value of the test.
I'd like 1) personally, I still see value in this test.

Thoughts?
--
Michael

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2016-03-02 05:06:52 Re: TAP / recovery-test fs-level backups, psql enhancements etc
Previous Message Tom Lane 2016-03-02 05:02:04 Re: TAP / recovery-test fs-level backups, psql enhancements etc