Re: pgsql: Allow using the updated tuple while moving it to a different par

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
Cc: Michael Paquier <michael(at)paquier(dot)xyz>, Amit Kapila <akapila(at)postgresql(dot)org>, pgsql-committers(at)lists(dot)postgresql(dot)org
Subject: Re: pgsql: Allow using the updated tuple while moving it to a different par
Date: 2018-07-12 21:57:14
Message-ID: 11887.1531432634@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-committers pgsql-hackers

Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> writes:
> On Thu, Jul 12, 2018 at 1:29 PM, Michael Paquier <michael(at)paquier(dot)xyz> wrote:
>> Looks unrelated, but flaviventris (a snake?) has just complained:
>> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=flaviventris&dt=2018-07-12%2007%3A36%3A01

> Yeah, I am confused as to how this commit can lead to that failure.

It didn't; looks like just a random failure to me. The test is timing
out:

ok 26 - catalog xmin of cascaded slot still null with hs_feedback reset
# re-enabling hot_standby_feedback and disabling while stopped
### Reloading node "standby_2"
# Running: pg_ctl -D /home/bf/build/buildfarm-flaviventris/HEAD/pgsql.build/src/test/recovery/tmp_check/t_001_stream_rep_standby_2_data/pgdata reload
server signaled
Waiting for replication conn standby_1's replay_lsn to pass '0/30A1720' on master
done
Waiting for replication conn standby_2's replay_lsn to pass '0/30A1720' on standby_1
done
### Stopping node "standby_2" using mode fast
# Running: pg_ctl -D /home/bf/build/buildfarm-flaviventris/HEAD/pgsql.build/src/test/recovery/tmp_check/t_001_stream_rep_standby_2_data/pgdata -m fast stop
waiting for server to shut down.... done
server stopped
# No postmaster PID for node "standby_2"
#
Timed out waiting for slot xmins to advance at t/001_stream_rep.pl line 157.

This looks to me like it's probably a low-probability timing problem in
the test script itself. The relevant part of the script is

$node_master->safe_psql('postgres', qq[INSERT INTO tab_int VALUES (11000);]);
replay_check();

$node_standby_2->safe_psql('postgres',
'ALTER SYSTEM SET hot_standby_feedback = off;');
$node_standby_2->stop;

($xmin, $catalog_xmin) =
get_slot_xmins($node_standby_1, $slotname_2, "xmin IS NOT NULL");
isnt($xmin, '', 'xmin of cascaded slot non-null with postgres shut down');

where the failure is occurring because get_slot_xmins() never sees the
expected state. I'm suspicious that this is happening because the script
is making no attempt to ensure the relative timing of events on the three
servers --- notably, issuing a "reload" does nothing to guarantee that
the server has actually seen and responded to the parameter change.

As an extreme case, I can reproduce the failure exactly with this:

diff --git a/src/test/recovery/t/001_stream_rep.pl b/src/test/recovery/t/001_stream_rep.pl
index a0d3e8f..0824184 100644
--- a/src/test/recovery/t/001_stream_rep.pl
+++ b/src/test/recovery/t/001_stream_rep.pl
@@ -286,7 +286,7 @@ is($catalog_xmin, '',
note "re-enabling hot_standby_feedback and disabling while stopped";
$node_standby_2->safe_psql('postgres',
'ALTER SYSTEM SET hot_standby_feedback = on;');
-$node_standby_2->reload;
+# $node_standby_2->reload;

$node_master->safe_psql('postgres', qq[INSERT INTO tab_int VALUES (11000);]);
replay_check();

so I'm thinking that what we're seeing is an effect of the second standby
sometimes being too slow to turn on hot_standby_feedback.

We could maybe hack around that with a short "sleep" in this script, but
a less cruddy solution would be to change PostgresNode::reload so that
it somehow waits for the server to process the reload signal.

regards, tom lane

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Fabien COELHO 2018-07-12 22:03:39 Re: [HACKERS] Re: [COMMITTERS] pgsql: Remove pgbench "progress" test pending solution of its timing is (fwd)
Previous Message Michael Paquier 2018-07-12 21:45:35 Re: Temporary WAL segments files not cleaned up after an instance crash

Browse pgsql-committers by date

  From Date Subject
Next Message Michael Paquier 2018-07-13 00:48:39 pgsql: Fix argument of pg_create_logical_replication_slot for slot name
Previous Message Michael Paquier 2018-07-12 21:44:09 pgsql: Clean up temporary WAL segments after an instance crash