Re: [PoC] pg_upgrade: allow to upgrade publisher node

From: Bharath Rupireddy <bharath(dot)rupireddyforpostgres(at)gmail(dot)com>
To: "Hayato Kuroda (Fujitsu)" <kuroda(dot)hayato(at)fujitsu(dot)com>
Cc: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Dilip Kumar <dilipbalaut(at)gmail(dot)com>, "Zhijie Hou (Fujitsu)" <houzj(dot)fnst(at)fujitsu(dot)com>, Peter Smith <smithpb2250(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Bruce Momjian <bruce(at)momjian(dot)us>, Julien Rouhaud <rjuju123(at)gmail(dot)com>, vignesh C <vignesh21(at)gmail(dot)com>, Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>
Subject: Re: [PoC] pg_upgrade: allow to upgrade publisher node
Date: 2023-09-25 07:36:16
Message-ID: CALj2ACXp+LXioY_=9mboEbLD--4c4nnpJCZ+j4fckBdSOQhENA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Sat, Sep 23, 2023 at 10:18 AM Hayato Kuroda (Fujitsu)
<kuroda(dot)hayato(at)fujitsu(dot)com> wrote:
>
> Again, thank you for reviewing! Here is a new version patch.

Here are some more comments/thoughts on the v44 patch:

1.
+# pg_upgrade will fail because the slot still has unconsumed WAL records
+command_fails(
+ [

Add a test case to hit fprintf(script, "The slot \"%s\" is invalid\n",
file as well?

2.
+ 'run of pg_upgrade where the new cluster has insufficient
max_replication_slots');
+ok( -d $new_publisher->data_dir . "/pg_upgrade_output.d",
+ "pg_upgrade_output.d/ not removed after pg_upgrade failure");

+ 'run of pg_upgrade where the new cluster has the wrong wal_level');
+ok( -d $new_publisher->data_dir . "/pg_upgrade_output.d",
+ "pg_upgrade_output.d/ not removed after pg_upgrade failure");

+ 'run of pg_upgrade of old cluster with idle replication slots');
+ok( -d $new_publisher->data_dir . "/pg_upgrade_output.d",
+ "pg_upgrade_output.d/ not removed after pg_upgrade failure");

How do these tests recognize the failures are the intended ones? I
mean, for instance when pg_upgrade fails for unused replication
slots/unconsumed WAL records, then just looking at the presence of
pg_upgrade_output.d might not be sufficient, no? Using
command_fails_like instead of command_fails and looking at the
contents of invalid_logical_relication_slots.txt might help make these
tests more focused.

3.
+ pg_log(PG_REPORT, "fatal");
+ pg_fatal("Your installation contains invalid logical
replication slots.\n"
+ "These slots can't be copied, so this cluster cannot
be upgraded.\n"
+ "Consider removing such slots or consuming the
pending WAL if any,\n"
+ "and then restart the upgrade.\n"
+ "A list of invalid logical replication slots is in
the file:\n"
+ " %s", output_path);

It's not just the invalid logical replication slots, but also the
slots with unconsumed WALs which aren't invalid and can be upgraded if
ensured the WAL is consumed. So, a better wording would be:
pg_fatal("Your installation contains logical replication slots
that cannot be upgraded.\n"
"List of all such logical replication slots is in the file:\n"
"These slots can't be copied, so this cluster cannot
be upgraded.\n"
"Consider removing invalid slots and/or consuming the
pending WAL if any,\n"
"and then restart the upgrade.\n"
" %s", output_path);

4.
+ /*
+ * There is a possibility that following records may be generated
+ * during the upgrade.
+ */
+ is_valid = is_xlog_record_type(rmid, info, RM_XLOG_ID,
XLOG_CHECKPOINT_SHUTDOWN) ||
+ is_xlog_record_type(rmid, info, RM_XLOG_ID,
XLOG_CHECKPOINT_ONLINE) ||
+ is_xlog_record_type(rmid, info, RM_XLOG_ID, XLOG_SWITCH) ||
+ is_xlog_record_type(rmid, info, RM_XLOG_ID, XLOG_FPI_FOR_HINT) ||
+ is_xlog_record_type(rmid, info, RM_XLOG_ID,
XLOG_PARAMETER_CHANGE) ||
+ is_xlog_record_type(rmid, info, RM_STANDBY_ID,
XLOG_RUNNING_XACTS) ||
+ is_xlog_record_type(rmid, info, RM_HEAP2_ID, XLOG_HEAP2_PRUNE);

What if we missed to capture the WAL records that may be generated
during upgrade?

What happens if a custom WAL resource manager generates table/index AM
WAL records during upgrade?

What happens if new WAL records are added that may be generated during
the upgrade? Isn't keeping this code extensible and in sync with
future changes a problem? Or we'd better say that any custom WAL
records are found after the slot's confirmed flush LSN, then the slot
isn't upgraded?

5. In continuation to the above comment:

Why can't this logic be something like - if there's any WAL record
seen after a slot's confirmed flush LSN is of type generated by WAL
resource manager having the rm_decode function defined, then the slot
can't be upgraded.

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Michael Paquier 2023-09-25 07:50:05 Re: GUC for temporarily disabling event triggers
Previous Message vignesh C 2023-09-25 07:35:03 Re: Invalidate the subscription worker in cases where a user loses their superuser status