Re: [COMMITTERS] pgsql: Replication lag tracking for walsenders

From: Mark Dilger <hornschnorter(at)gmail(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [COMMITTERS] pgsql: Replication lag tracking for walsenders
Date: 2017-04-23 02:44:28
Message-ID: 3177FAD5-3600-4C8B-B338-FB108F60455F@gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-committers pgsql-hackers


> On Apr 22, 2017, at 11:40 AM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>
> I wrote:
>> Whoa. This just turned into a much larger can of worms than I expected.
>> How can it be that processes are getting assertion crashes and yet the
>> test framework reports success anyway? That's impossibly
>> broken/unacceptable.
>
> I poked into this on my laptop, where I'm able to reproduce both assertion
> failures. The reason the one in subtrans.c is not being detected by the
> test framework is that it happens late in the run and the test doesn't
> do anything more with that server than shut it down. "pg_ctl stop"
> actually notices there's a problem; for instance, the log on tern for
> this step shows
>
> ### Stopping node "master" using mode immediate
> # Running: pg_ctl -D /home/nm/farm/gcc32/HEAD/pgsql.build/src/test/recovery/tmp_check/data_master_xHGA/pgdata -m immediate stop
> pg_ctl: PID file "/home/nm/farm/gcc32/HEAD/pgsql.build/src/test/recovery/tmp_check/data_master_xHGA/pgdata/postmaster.pid" does not exist
> Is server running?
> # No postmaster PID
>
> However, PostgresNode::stop blithely ignores the failure exit from
> pg_ctl. I think it should use system_or_bail() not just system_log(),
> so that we'll notice if the server is not running when we expect it
> to be. It's conceivable that there should also be a "stop_if_running"
> method that allows the case where the server is not running, but so
> far as I can find, no existing TAP test needs such a behavior --- and
> it certainly shouldn't be the default.
>
> The walsender.c crash is harder to detect because the postmaster very
> nicely recovers and restarts its children. That's great for robustness,
> but it sucks for testing. I think that we ought to disable that in
> TAP tests, which we can do by having PostgresNode::init include
> "restart_after_crash = off" in the postgresql.conf modifications it
> applies. Again, it's conceivable that some tests would not want that,
> but there is no existing test that seems to need crash recovery on,
> and I don't see a good argument for it being default for test purposes.
>
> As best I've been able to tell, the reason why the walsender.c crash
> is detected when it's detected is that the TAP script chances to try
> to connect while the recovery is happening:
>
> connection error: 'psql: FATAL: the database system is in recovery mode'
> while running 'psql -XAtq -d port=57718 host=/tmp/_uP8FKEynq dbname='postgres' -f - -v ON_ERROR_STOP=1' at /home/nm/farm/gcc32/HEAD/pgsql.build/src/test/recovery/../../../src/test/perl/PostgresNode.pm line 1173.
>
> The window for that is not terribly wide, explaining why the detection
> is unreliable even when the crash does occur.
>
> In short then, I propose the attached patch to make these cases fail
> more reliably. We might extend this later to allow the old behaviors
> to be explicitly opted-into, but we don't seem to need that today.
>
> regards, tom lane

I pulled fresh sources with your latest commit, 7d68f2281a4b56834c8e5648fc7da0b73b674c45,
and the tests consistently fail (5 out of 5 attempts) for me on my laptop with:

/Applications/Xcode.app/Contents/Developer/usr/bin/make -C recovery check
for extra in contrib/test_decoding; do /Applications/Xcode.app/Contents/Developer/usr/bin/make -C '../../..'/$extra DESTDIR='/Users/mark/vanilla/bitoctetlength'/tmp_install install >>'/Users/mark/vanilla/bitoctetlength'/tmp_install/log/install.log || exit; done
rm -rf /Users/mark/vanilla/bitoctetlength/src/test/recovery/tmp_check/log
cd . && TESTDIR='/Users/mark/vanilla/bitoctetlength/src/test/recovery' PATH="/Users/mark/vanilla/bitoctetlength/tmp_install/usr/local/pgsql/bin:$PATH" DYLD_LIBRARY_PATH="/Users/mark/vanilla/bitoctetlength/tmp_install/usr/local/pgsql/lib" PGPORT='65432' PG_REGRESS='/Users/mark/vanilla/bitoctetlength/src/test/recovery/../../../src/test/regress/pg_regress' prove -I ../../../src/test/perl/ -I . t/*.pl
t/001_stream_rep.pl .................. ok
t/002_archiving.pl ................... ok
t/003_recovery_targets.pl ............ ok
t/004_timeline_switch.pl ............. Bailout called. Further testing stopped: system pg_ctl failed
FAILED--Further testing stopped: system pg_ctl failed
make[2]: *** [check] Error 255
make[1]: *** [check-recovery-recurse] Error 2
make: *** [check-world-src/test-recurse] Error 2

The first time, I had some local changes, but I've stashed them and am still getting the error each of
these next four times. I'm compiling as follows:

make distclean; make clean; ./configure --enable-cassert --enable-tap-tests --enable-depend && make -j4 && make check-world

Are the errors expected now?

mark

In response to

Responses

Browse pgsql-committers by date

  From Date Subject
Next Message Tom Lane 2017-04-23 05:26:46 Re: [COMMITTERS] pgsql: Replication lag tracking for walsenders
Previous Message Thomas Munro 2017-04-23 02:11:40 Re: [COMMITTERS] pgsql: Replication lag tracking for walsenders

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2017-04-23 05:26:46 Re: [COMMITTERS] pgsql: Replication lag tracking for walsenders
Previous Message Bruce Momjian 2017-04-23 02:37:06 Statistics "dependency"