RE: [PoC] pg_upgrade: allow to upgrade publisher node

From: "Hayato Kuroda (Fujitsu)" <kuroda(dot)hayato(at)fujitsu(dot)com>
To: 'Peter Smith' <smithpb2250(at)gmail(dot)com>
Cc: PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Bruce Momjian <bruce(at)momjian(dot)us>, Julien Rouhaud <rjuju123(at)gmail(dot)com>, vignesh C <vignesh21(at)gmail(dot)com>, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>, "Zhijie Hou (Fujitsu)" <houzj(dot)fnst(at)fujitsu(dot)com>
Subject: RE: [PoC] pg_upgrade: allow to upgrade publisher node
Date: 2023-08-21 13:04:58
Message-ID: TYCPR01MB5870AC61339753C7E3F0F1B0F51EA@TYCPR01MB5870.jpnprd01.prod.outlook.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Dear Peter,

Thank you for reviewing! The patch can be available in [1].

> Commit Message
>
> 1.
> This commit allows nodes with logical replication slots to be upgraded. While
> reading information from the old cluster, a list of logical replication slots is
> newly extracted. At the later part of upgrading, pg_upgrade revisits the list
> and restores slots by using the pg_create_logical_replication_slots() on the new
> clushter.
>
> ~
>
> 1a
> /is newly extracted/is fetched/

Fixed.

> 1b.
> /using the pg_create_logical_replication_slots()/executing
> pg_create_logical_replication_slots()/

Fixed.

> 1c.
> /clushter/cluster/

Fixed.

> 2.
> Note that it must be done after the final pg_resetwal command during the upgrade
> because pg_resetwal will remove WALs that are required by the slots. Due to the
> restriction, the timing of restoring replication slots is different from other
> objects.
>
> ~
>
> 2a.
> /it must/slot restoration/

You meant to say s/it must/slot restoration must/, right? Fixed.

> 2b.
> /the restriction/this restriction/
>
> ======
> doc/src/sgml/ref/pgupgrade.sgml
>
> 3.
> + <para>
> + <application>pg_upgrade</application> attempts to migrate logical
> + replication slots. This helps avoid the need for manually defining the
> + same replication slot on the new publisher.
> + </para>
>
> /same replication slot/same replication slots/

Fixed.

> 4.
> + <para>
> + Before you start upgrading the publisher cluster, ensure that the
> + subscription is temporarily disabled, by executing
> + <link linkend="sql-altersubscription"><command>ALTER
> SUBSCRIPTION ... DISABLE</command></link>.
> + After the upgrade is complete, execute the
> + <command>ALTER SUBSCRIPTION ... CONNECTION</command>
> command to update the
> + connection string, and then re-enable the subscription.
> + </para>
>
> On the rendered page, it looks a bit strange that DISABLE has a link
> but COMMENTION does not have a link.

Added.

> 5.
> + <para>
> + There are some prerequisites for
> <application>pg_upgrade</application> to
> + be able to upgrade the replication slots. If these are not met an error
> + will be reported.
> + </para>
> +
> + <itemizedlist>
>
> +1 to use all the itemizedlist changes that Amit suggested [1] in his
> attachment.

Yeah, I agreed it is nice. Applied.

> src/bin/pg_upgrade/check.c
>
> 6.
> +static void check_for_logical_replication_slots(ClusterInfo *new_cluster);
>
> IMO the arg name should not shadow a global with the same name. See
> other review comment for this function signature.

OK, fixed.

> 7.
> + /* Extract a list of logical replication slots */
> + get_logical_slot_infos(&old_cluster, live_check);
>
> But 'live_check' is never used?

It is needed for 0003, moved.

> 8. check_for_logical_replication_slots
> +
> +/*
> + * Verify the parameter settings necessary for creating logical replication
> + * slots.
> + */
> +static void
> +check_for_logical_replication_slots(ClusterInfo *new_cluster)
>
> IMO the arg name should not shadow a global with the same name. If
> this is never going to be called with any param other than
> &new_cluster then probably it is better not even to pass have that
> argument at all. Just refer to the global new_cluster inside the
> function.
>
> You can't say that 'check_for_new_tablespace_dir' does it already so
> it must be OK -- I think that the existing function has the same issue
> and it also ought to be fixed to avoid shadowing!

Fixed.

> 9. check_for_logical_replication_slots
>
> + /* logical replication slots can be migrated since PG17. */
> + if (GET_MAJOR_VERSION(new_cluster->major_version) <= 1600)
> + return;
>
> IMO the code matches the comment better if you say < 1700 instead of <= 1600.

Changed.

> src/bin/pg_upgrade/function.c
>
> 10. get_loadable_libraries
> /*
> - * Fetch all libraries containing non-built-in C functions in this DB.
> + * Fetch all libraries containing non-built-in C functions or referred
> + * by logical replication slots in this DB.
> */
> ress[dbnum] = executeQueryOrDie(conn,
> ~
>
> /referred by/referred to by/

Fixed.

> src/bin/pg_upgrade/info.c
>
> 11.
> +/*
> + * get_logical_slot_infos()
> + *
> + * Higher level routine to generate LogicalSlotInfoArr for all databases.
> + */
> +void
> +get_logical_slot_infos(ClusterInfo *cluster, bool live_check)
> +{
> + int dbnum;
> + int slot_count = 0;
> +
> + if (cluster == &old_cluster)
> + pg_log(PG_VERBOSE, "\nsource databases:");
> + else
> + pg_log(PG_VERBOSE, "\ntarget databases:");
> +
> + for (dbnum = 0; dbnum < cluster->dbarr.ndbs; dbnum++)
> + {
> + DbInfo *pDbInfo = &cluster->dbarr.dbs[dbnum];
> +
> + get_logical_slot_infos_per_db(cluster, pDbInfo);
> + slot_count += pDbInfo->slot_arr.nslots;
> +
> + if (log_opts.verbose)
> + {
> + pg_log(PG_VERBOSE, "Database: \"%s\"", pDbInfo->db_name);
> + print_slot_infos(&pDbInfo->slot_arr);
> + }
> + }
> +}
> +
>
> 11a.
> Now the variable 'slot_count' is no longer being returned so it seems redundant.
>
> ~
>
> 11b.
> What is the 'live_check' parameter for? Nobody is using it.

These are needed for 0003, moved.

> 12. count_logical_slots
>
> +int
> +count_logical_slots(ClusterInfo *cluster)
> +{
> + int dbnum;
> + int slotnum = 0;
> +
> + for (dbnum = 0; dbnum < cluster->dbarr.ndbs; dbnum++)
> + slotnum += cluster->dbarr.dbs[dbnum].slot_arr.nslots;
> +
> + return slotnum;
> +}
>
> IMO this variable should be called something like 'slot_count'. This
> is the same review comment also made in a previous review. (See [2]
> comment#12).

Changed.

> 13. print_slot_infos
>
> +
> +static void
> +print_slot_infos(LogicalSlotInfoArr *slot_arr)
> +{
> + int slotnum;
> +
> + for (slotnum = 0; slotnum < slot_arr->nslots; slotnum++)
> + pg_log(PG_VERBOSE, "slotname: \"%s\", plugin: \"%s\", two_phase: %d",
> + slot_arr->slots[slotnum].slotname,
> + slot_arr->slots[slotnum].plugin,
> + slot_arr->slots[slotnum].two_phase);
> +}
>
> It might be nicer to introduce a variable, instead of all those array
> dereferences:
>
> LogicalSlotInfo *slot_info = &slot_arr->slots[slotnum];

Changed.

> 14.
> + for (slotnum = 0; slotnum < slot_arr->nslots; slotnum++)
> + {
> + /*
> + * Constructs query for creating logical replication slots.
> + *
> + * XXX: For simplification, pg_create_logical_replication_slot() is
> + * used. Is it sufficient?
> + */
> + appendPQExpBuffer(query, "SELECT
> pg_catalog.pg_create_logical_replication_slot(");
> + appendStringLiteralConn(query, slot_arr->slots[slotnum].slotname,
> + conn);
> + appendPQExpBuffer(query, ", ");
> + appendStringLiteralConn(query, slot_arr->slots[slotnum].plugin,
> + conn);
> + appendPQExpBuffer(query, ", false, %s);",
> + slot_arr->slots[slotnum].two_phase ? "true" : "false");
> +
> + PQclear(executeQueryOrDie(conn, "%s", query->data));
> +
> + resetPQExpBuffer(query);
> + }
> +
> + PQfinish(conn);
> +
> + destroyPQExpBuffer(query);
> + }
> +
> + end_progress_output();
> + check_ok();
>
> 14a
> Similar to the previous comment (#13). It might be nicer to introduce
> a variable, instead of all those array dereferences:
>
> LogicalSlotInfo *slot_info = &slot_arr->slots[slotnum];

Changed.

> 14b.
> It was not clear to me why this command is not being built using
> executeQueryOrDie directly instead of using the query buffer. Is there
> some reason?

I wanted to take care the encoding, that was the reason I used PQExpBuffer
functions, especially appendStringLiteralConn(). IIUC executeQueryOrDie() could
not take care it.

> 14c.
> I think it would be cleaner to have a separate res variable like you
> used elsewhere:
> res = executeQueryOrDie(...)
>
> instead of doing PQclear(executeQueryOrDie(conn, "%s", query->data));
> in one line

Hmm, there are some use cases for PQclear(executeQueryOrDie(...)) style, e.g.,
set_locale_and_encoding() and set_frozenxids(). I do not think your style is good
if the result of the query is not used. please tell me if you find a case that
res = executeQueryOrDie(...) is used but result is not checked.

> src/bin/pg_upgrade/pg_upgrade.
>
> 15.
> +void get_logical_slot_infos(ClusterInfo *cluster, bool live_check);
>
> I didn't see a reason for that 'live_check' parameter.

It was needed for 0003, moved.

> .../pg_upgrade/t/003_logical_replication_slots.pl
>
> 16.
> IMO this would be much easier to read if there were BIG comments
> between the actual TEST parts
>
> For example
>
> # ------------------------------
> # TEST: Confirm pg_upgrade fails is new node wal_level is not 'logical'
> <preparation>
> <test>
> <cleanup>
>
> # ------------------------------
> # TEST: Confirm pg_upgrade fails max_replication_slots on new node is too low
> <preparation>
> <test>
> <cleanup>
>
> # ------------------------------
> # TEST: Successful upgrade
> <preparation>
> <test>
> <cleanup>

Added. 0003 also followed the style.

> 17.
> +# Cause a failure at the start of pg_upgrade because wal_level is replica
> +command_fails(
> + [
> + 'pg_upgrade', '--no-sync',
> + '-d', $old_publisher->data_dir,
> + '-D', $new_publisher->data_dir,
> + '-b', $bindir,
> + '-B', $bindir,
> + '-s', $new_publisher->host,
> + '-p', $old_publisher->port,
> + '-P', $new_publisher->port,
> + $mode,
> + ],
> + 'run of pg_upgrade of old node with wrong wal_level');
> +ok( -d $new_publisher->data_dir . "/pg_upgrade_output.d",
> + "pg_upgrade_output.d/ not removed after pg_upgrade failure");
>
> The message is ambiguous
>
> BEFORE
> 'run of pg_upgrade of old node with wrong wal_level'
>
> SUGGESTION
> 'run of pg_upgrade where the new node has the wrong wal_level'

Changed.

> 18.
> +# Create an unnecessary slot on old node
> +$old_publisher->start;
> +$old_publisher->safe_psql(
> + 'postgres', qq[
> + SELECT pg_create_logical_replication_slot('test_slot2',
> 'test_decoding', false, true);
> +]);
> +
> +$old_publisher->stop;
> +
> +# Preparations for the subsequent test. max_replication_slots is set to
> +# smaller than existing slots on old node
> +$new_publisher->append_conf('postgresql.conf', "wal_level = 'logical'");
> +$new_publisher->append_conf('postgresql.conf', "max_replication_slots = 1");
>
>
> IMO the comment is misleading. It is not an "unnecessary slot", it is
> just a 2nd slot. And this is all part of the preparation for the next
> test so it should be under the other comment.
>
> For example SUGGESTION changes like this:
>
> # Preparations for the subsequent test.
> # 1. Create an unnecessary slot on the old node
> $old_publisher->start;
> $old_publisher->safe_psql(
> 'postgres', qq[
> SELECT pg_create_logical_replication_slot('test_slot2',
> 'test_decoding', false, true);
> ]);
> $old_publisher->stop;
> # 2. max_replication_slots is set to smaller than the number of slots
> (2) present on the old node
> $new_publisher->append_conf('postgresql.conf', "max_replication_slots = 1");
> # 3. new node wal_level is set correctly
> $new_publisher->append_conf('postgresql.conf', "wal_level = 'logical'");

Followed the style.

> 19.
> +# Remove an unnecessary slot and consume WAL records
> +$old_publisher->start;
> +$old_publisher->safe_psql(
> + 'postgres', qq[
> + SELECT pg_drop_replication_slot('test_slot2');
> + SELECT count(*) FROM pg_logical_slot_get_changes('test_slot1', NULL,
> NULL)
> +]);
> +$old_publisher->stop;
> +
>
> This comment should say more like:
>
> # Preparations for the subsequent test.

Followed above style.

> 20.
> +# Actual run, pg_upgrade_output.d is removed at the end
>
> This comment should mention that "successful upgrade is expected"
> because all the other prerequisites are now satisfied.

The suggestion was added to the comment

> 21.
> +$new_publisher->start;
> +my $result = $new_publisher->safe_psql('postgres',
> + "SELECT slot_name, two_phase FROM pg_replication_slots");
> +is($result, qq(test_slot1|t), 'check the slot exists on new node');
>
> Should there be a matching new_pulisher->stop;?

Not sure it is really needed, but added.
Also, the word "node" was replaced to "cluster" because the later word is used
in the doc.

[1]: https://www.postgresql.org/message-id/TYCPR01MB5870B5C0FE0C61CD04CBD719F51EA%40TYCPR01MB5870.jpnprd01.prod.outlook.com

Best Regards,
Hayato Kuroda
FUJITSU LIMITED

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Hayato Kuroda (Fujitsu) 2023-08-21 13:05:57 RE: [PoC] pg_upgrade: allow to upgrade publisher node
Previous Message Hayato Kuroda (Fujitsu) 2023-08-21 13:04:06 RE: [PoC] pg_upgrade: allow to upgrade publisher node