Timing-sensitive case in src/test/recovery TAP tests

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: pgsql-hackers(at)postgreSQL(dot)org
Subject: Timing-sensitive case in src/test/recovery TAP tests
Date: 2017-06-25 21:10:57
Message-ID: 8962.1498425057@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

I've been experimenting with a change to pg_ctl, which I'll post
separately, to reduce its reaction time so that it reports success
more quickly after a wait for postmaster start/stop. I found one
case in "make check-world" that got a failure when I reduced the
reaction time to ~1ms. That's the very last test in 001_stream_rep.pl,
"cascaded slot xmin reset after startup with hs feedback reset", and
the cause appears to be that it's not allowing any delay time for a
replication slot's state to update after a postmaster restart.

This seems worth fixing independently of any possible code changes,
because it shows that this test could fail on a slow or overloaded
machine. I couldn't find any instances of such a failure in the
buildfarm archives, but that may have a lot to do with the fact that
owners of slow buildfarm animals are (mostly?) not running this test.

Some experimentation says that the minimum delay needed to make it
work reliably on my workstation is about 100ms, so a simple patch
along the lines of the attached might be good enough. I find this
approach conceptually dissatisfying, though, since it's still
potentially vulnerable to the failure under sufficient load.
I wonder if there is an easy way to improve that ... maybe convert
to something involving poll_query_until?

regards, tom lane

Attachment Content-Type Size
add-delay-in-time-sensitive-test.patch text/x-diff 611 bytes

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2017-06-25 22:13:18 Reducing pg_ctl's reaction time
Previous Message Fabien COELHO 2017-06-25 18:42:58 Re: pgbench tap tests & minor fixes