Re: Some thoughts about the TAP tests' wait_for_catchup()

From: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: Some thoughts about the TAP tests' wait_for_catchup()
Date: 2021-09-29 11:52:39
Message-ID: CAA4eK1+_9kJ_tiEy7k7Ks--2p01DAA5t53byH96T1DQ3JBNaZw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, Sep 29, 2021 at 3:47 AM Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>
> I noticed that some test scripts, instead of using wait_for_catchup
> to wait for replication catchup, use ad-hoc code like
>
> my $primary_lsn =
> $primary->safe_psql('postgres', 'select pg_current_wal_lsn()');
> $standby->poll_query_until('postgres',
> qq{SELECT '$primary_lsn'::pg_lsn <= pg_last_wal_replay_lsn()})
> or die "standby never caught up";
>
> This does not look much like what wait_for_catchup does, which
> typically ends up issuing queries like
>
> SELECT pg_current_wal_lsn() <= replay_lsn AND state = 'streaming'
> FROM pg_catalog.pg_stat_replication WHERE application_name = 'standby';
>
> It seems to me that for most purposes wait_for_catchup's approach is
> strictly worse, for two reasons:
>
> 1. It continually recomputes the primary's pg_current_wal_lsn().
> Thus, if anything is happening on the primary (e.g. autovacuum),
> we're moving the goalposts as to how much the standby is required
> to replay before we continue. That slows down the tests, makes
> them less reproducible, and could help to mask actual bugs of the
> form this-hasn't-been-done-when-it-should-have.
>
> 2. It's querying the primary's view of the standby's state, which
> introduces a reporting delay. This has exactly the same three
> problems as the other point.
>

I can't comment on all the use cases of wait_for_catchup() but I think
there are some use cases in logical replication where we need the
publisher to use wait_for_catchup after setting up the replication to
ensure that wal sender is started and in-proper state by checking its
state (which should be 'streaming'). That also implicitly checks if
the wal receiver has responded to initial ping requests by sending
replay location. The typical use is as below where after setting up
initial replication we wait for a publisher to catch up and then check
if the initial table sync is finished. There is no use in checking the
second till the first statement is completed.

subscription/t/001_rep_changes.pl
...
$node_publisher->wait_for_catchup('tap_sub');

# Also wait for initial table sync to finish
my $synced_query =
"SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT
IN ('r', 's');";
$node_subscriber->poll_query_until('postgres', $synced_query)
or die "Timed out while waiting for subscriber to synchronize data";

I am not sure in such tests if checking solely the subscriber's
wal_replay_lsn would be sufficient. So, I think there are cases
especially in physical replication tests where we can avoid using
wait_for_catchup but I am not sure if we can completely eliminate it.

--
With Regards,
Amit Kapila.

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Amit Kapila 2021-09-29 11:55:07 Re: Failed transaction statistics to measure the logical replication progress
Previous Message Drouvot, Bertrand 2021-09-29 11:36:14 Re: [BUG] failed assertion in EnsurePortalSnapshotExists()