Re: Minimal logical decoding on standbys

From: Andres Freund <andres(at)anarazel(dot)de>
To: "Drouvot, Bertrand" <bdrouvot(at)amazon(dot)com>
Cc: fabriziomello(at)gmail(dot)com, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, tushar <tushar(dot)ahuja(at)enterprisedb(dot)com>, "[pgdg] Robert Haas" <robertmhaas(at)gmail(dot)com>, Amit Khandekar <amitdkhan(dot)pg(at)gmail(dot)com>, Rahila Syed <rahila(dot)syed(at)2ndquadrant(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Minimal logical decoding on standbys
Date: 2021-04-07 17:09:54
Message-ID: 20210407170954.xbovgh3xdrnogg7w@alap3.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

I think I'll remove the "Catalog xmins should advance after standby
logical slot fetches the changes." test. For one, it takes a long time
(due to the 2000 psqls). But more importantly, it's simply not testing
anything that's reliable:

1) There's no guarantee that I can see that catalog_xmin needs to
increase, just because we called pg_logical_slot_get_changes() once.

2) $node_master->wait_for_catchup($node_standby, 'replay',
$node_master->lsn('flush')); is definitely not OK. It only happens to
work by accident / the 2000 iterations. There might not be any logical
changes associated with that LSN, so there'd might not be anything to
replay. That's especially true for the second wait_for_catchup - there
haven't been any logical changes since the last wait.

The test hangs reliably for me if I replace the 2000 with 2. Kinda looks
like somebody just tried to add more and more inserts to make the test
not fail, without addressing the reliability issues. That kind of thing
rarely works out well, because it tends to be very machine specific to
get the timing right. And it makes the test take forever.

TBH, most of 024_standby_logical_decoding_xmins.pl looks like they've
been minimally hacked up the tests from Craig's quite different patch,
without adjusting them. There's stuff like:

# Create new slots on the standby, ignoring the ones on the master completely.
#
# This must succeed since we know we have a catalog_xmin reservation. We
# might've already sent hot standby feedback to advance our physical slot's
# catalog_xmin but not received the corresponding xlog for the catalog xmin
# advance, in which case we'll create a slot that isn't usable. The calling
# application can prevent this by creating a temporary slot on the master to
# lock in its catalog_xmin. For a truly race-free solution we'd need
# master-to-standby hot_standby_feedback replies.
#
# In this case it won't race because there's no concurrent activity on the
# master.
#

This issue doesn't exist in the patch.

There's also no test for a recovery conflict due to row removal. Despite
that being a substantial part of the patchset.

I'm tempted to throw out 024 - all of its tests seem fragile and prove
little. And then add a few more tests to 025 (and renaming it).

Greetings,

Andres Freund

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Richard Yen 2021-04-07 17:13:30 dump cannot be restored if schema permissions revoked
Previous Message Tom Lane 2021-04-07 17:07:59 Re: buildfarm instance bichir stuck