Re: [COMMITTERS] pgsql: Replication lag tracking for walsenders

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>
Cc: Simon Riggs <simon(at)2ndquadrant(dot)com>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [COMMITTERS] pgsql: Replication lag tracking for walsenders
Date: 2017-04-22 18:40:28
Message-ID: 21432.1492886428@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-committers pgsql-hackers

I wrote:
> Whoa. This just turned into a much larger can of worms than I expected.
> How can it be that processes are getting assertion crashes and yet the
> test framework reports success anyway? That's impossibly
> broken/unacceptable.

I poked into this on my laptop, where I'm able to reproduce both assertion
failures. The reason the one in subtrans.c is not being detected by the
test framework is that it happens late in the run and the test doesn't
do anything more with that server than shut it down. "pg_ctl stop"
actually notices there's a problem; for instance, the log on tern for
this step shows

### Stopping node "master" using mode immediate
# Running: pg_ctl -D /home/nm/farm/gcc32/HEAD/pgsql.build/src/test/recovery/tmp_check/data_master_xHGA/pgdata -m immediate stop
pg_ctl: PID file "/home/nm/farm/gcc32/HEAD/pgsql.build/src/test/recovery/tmp_check/data_master_xHGA/pgdata/postmaster.pid" does not exist
Is server running?
# No postmaster PID

However, PostgresNode::stop blithely ignores the failure exit from
pg_ctl. I think it should use system_or_bail() not just system_log(),
so that we'll notice if the server is not running when we expect it
to be. It's conceivable that there should also be a "stop_if_running"
method that allows the case where the server is not running, but so
far as I can find, no existing TAP test needs such a behavior --- and
it certainly shouldn't be the default.

The walsender.c crash is harder to detect because the postmaster very
nicely recovers and restarts its children. That's great for robustness,
but it sucks for testing. I think that we ought to disable that in
TAP tests, which we can do by having PostgresNode::init include
"restart_after_crash = off" in the postgresql.conf modifications it
applies. Again, it's conceivable that some tests would not want that,
but there is no existing test that seems to need crash recovery on,
and I don't see a good argument for it being default for test purposes.

As best I've been able to tell, the reason why the walsender.c crash
is detected when it's detected is that the TAP script chances to try
to connect while the recovery is happening:

connection error: 'psql: FATAL: the database system is in recovery mode'
while running 'psql -XAtq -d port=57718 host=/tmp/_uP8FKEynq dbname='postgres' -f - -v ON_ERROR_STOP=1' at /home/nm/farm/gcc32/HEAD/pgsql.build/src/test/recovery/../../../src/test/perl/PostgresNode.pm line 1173.

The window for that is not terribly wide, explaining why the detection
is unreliable even when the crash does occur.

In short then, I propose the attached patch to make these cases fail
more reliably. We might extend this later to allow the old behaviors
to be explicitly opted-into, but we don't seem to need that today.

regards, tom lane

Attachment Content-Type Size
detect-server-crashes-in-tap-tests.patch text/x-diff 861 bytes

In response to

Responses

Browse pgsql-committers by date

  From Date Subject
Next Message Tom Lane 2017-04-22 20:58:23 pgsql: Make PostgresNode::append_conf append a newline automatically.
Previous Message Tom Lane 2017-04-22 16:37:50 Re: [COMMITTERS] pgsql: Replication lag tracking for walsenders

Browse pgsql-hackers by date

  From Date Subject
Next Message Petr Jelinek 2017-04-22 19:08:18 Re: valgrind error in subscription code
Previous Message Andres Freund 2017-04-22 18:31:23 valgrind error in subscription code