Re: [PoC] pg_upgrade: allow to upgrade publisher node

From: Peter Smith <smithpb2250(at)gmail(dot)com>
To: "Hayato Kuroda (Fujitsu)" <kuroda(dot)hayato(at)fujitsu(dot)com>
Cc: PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Bruce Momjian <bruce(at)momjian(dot)us>, Julien Rouhaud <rjuju123(at)gmail(dot)com>, vignesh C <vignesh21(at)gmail(dot)com>, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>, "Zhijie Hou (Fujitsu)" <houzj(dot)fnst(at)fujitsu(dot)com>
Subject: Re: [PoC] pg_upgrade: allow to upgrade publisher node
Date: 2023-08-18 03:51:38
Message-ID: CAHut+PuZ5kUnD6uc=x-t=6_iUKfJ2m7Qvv3pPDh57Onsi6bfHg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Here are some review comments for the patch v21-0003

======
Commit message

1.
pg_upgrade fails if the old node has slots which status is 'lost' or they do not
consume all WAL records. These are needed for prevent the data loss.

~

Maybe some minor brush-up like:

SUGGESTION
In order to prevent data loss, pg_upgrade will fail if the old node
has slots with the status 'lost', or with unconsumed WAL records.

======
src/bin/pg_upgrade/check.c

2. check_for_confirmed_flush_lsn

+ /* Check that all logical slots are not in 'lost' state. */
+ res = executeQueryOrDie(conn,
+ "SELECT slot_name FROM pg_catalog.pg_replication_slots "
+ "WHERE temporary = false AND wal_status = 'lost';");
+
+ ntups = PQntuples(res);
+ i_slotname = PQfnumber(res, "slot_name");
+
+ for (i = 0; i < ntups; i++)
+ {
+ is_error = true;
+
+ pg_log(PG_WARNING,
+ "\nWARNING: logical replication slot \"%s\" is obsolete.",
+ PQgetvalue(res, i, i_slotname));
+ }
+
+ PQclear(res);
+
+ if (is_error)
+ pg_fatal("logical replication slots not to be in 'lost' state.");
+

2a. (GENERAL)
The above code for checking lost state seems out of place in this
function which is meant for checking confirmed flush lsn.

Maybe you jammed both kinds of logic into one function to save on the
extra PGconn or something but IMO two separate functions would be
better. e.g.
- check_for_lost_slots
- check_for_confirmed_flush_lsn

~

2b.
+ /* Check that all logical slots are not in 'lost' state. */

SUGGESTION
/* Check there are no logical replication slots with a 'lost' state. */

~

2c.
+ res = executeQueryOrDie(conn,
+ "SELECT slot_name FROM pg_catalog.pg_replication_slots "
+ "WHERE temporary = false AND wal_status = 'lost';");

This SQL fragment is very much like others in previous patches. Be
sure to make all the cases and clauses consistent with all those
similar SQL fragments.

~

2d.
+ is_error = true;

That doesn't need to be in the loop. Better to just say:
is_error = (ntups > 0);

~

2e.
There is a mix of terms in the WARNING and in the pg_fatal -- e.g.
"obsolete" versus "lost". Is it OK?

~

2f.
+ pg_fatal("logical replication slots not to be in 'lost' state.");

English? And maybe it should be much more verbose...

"Upgrade of this installation is not allowed because one or more
logical replication slots with a state of 'lost' were detected."

~~~

3. check_for_confirmed_flush_lsn

+ /*
+ * Check that all logical replication slots have reached the latest
+ * checkpoint position (SHUTDOWN_CHECKPOINT record). This checks cannot be
+ * done in case of live_check because the server has not been written the
+ * SHUTDOWN_CHECKPOINT record yet.
+ */
+ if (!live_check)
+ {
+ res = executeQueryOrDie(conn,
+ "SELECT slot_name FROM pg_catalog.pg_replication_slots "
+ "WHERE confirmed_flush_lsn != '%X/%X' AND temporary = false;",
+ old_cluster.controldata.chkpnt_latest_upper,
+ old_cluster.controldata.chkpnt_latest_lower);
+
+ ntups = PQntuples(res);
+ i_slotname = PQfnumber(res, "slot_name");
+
+ for (i = 0; i < ntups; i++)
+ {
+ is_error = true;
+
+ pg_log(PG_WARNING,
+ "\nWARNING: logical replication slot \"%s\" has not consumed WALs yet",
+ PQgetvalue(res, i, i_slotname));
+ }
+
+ PQclear(res);
+ PQfinish(conn);
+
+ if (is_error)
+ pg_fatal("All logical replication slots consumed all the WALs.");

~

3a.
/This checks/This check/

~

3b.
I don't think the separation of
chkpnt_latest_upper/chkpnt_latest_lower is needed like this. AFAIK
there is an LSN_FORMAT_ARGS(lsn) macro designed for handling exactly
this kind of parameter substitution.

~

3c.
+ is_error = true;

That doesn't need to be in the loop. Better to just say:
is_error = (ntups > 0);

~

3d.
+ pg_fatal("All logical replication slots consumed all the WALs.");

The message seems backward. shouldn't it say something like:
"Upgrade of this installation is not allowed because one or more
logical replication slots still have unconsumed WAL records."

======
src/bin/pg_upgrade/controldata.c

4. get_control_data

+ /*
+ * Upper and lower part of LSN must be read and stored
+ * separately because it is reported as %X/%X format.
+ */
+ cluster->controldata.chkpnt_latest_upper =
+ strtoul(p, &slash, 16);
+ cluster->controldata.chkpnt_latest_lower =
+ strtoul(++slash, NULL, 16);

I felt that this field separation code is maybe not necessary. Please
refer to other review comments in this post.

======
src/bin/pg_upgrade/pg_upgrade.h

5. ControlData

+
+ uint32 chkpnt_latest_upper;
+ uint32 chkpnt_latest_lower;
} ControlData;

~

Actually, I did not recognise the reason why this cannot be stored
properly as a single XLogRecPtr field. Please see other review
comments in this post.

======
.../t/003_logical_replication_slots.pl

6. GENERAL

Many of the changes to this file are just renaming the
'old_node'/'new_node' to 'old_publisher'/'new_publisher'.

This seems a basic change not really associated with this patch 0003.
To reduce the code churn, this change should be moved into the earlier
patch where this test file (003_logical_replication_slots.pl) was
first introduced,

~~~

7.

# Cause a failure at the start of pg_upgrade because slot do not finish
# consuming all the WALs

~

Can you give a more detailed explanation in the comment of how this
test case achieves what it says?

======
src/test/regress/sql/misc_functions.sql

8.
@@ -236,4 +236,4 @@ SELECT * FROM pg_split_walfile_name('invalid');
SELECT segment_number > 0 AS ok_segment_number, timeline_id
FROM pg_split_walfile_name('000000010000000100000000');
SELECT segment_number > 0 AS ok_segment_number, timeline_id
- FROM pg_split_walfile_name('ffffffFF00000001000000af');
+ FROM pg_split_walfile_name('ffffffFF00000001000000af');
\ No newline at end of file

~

What is this change for? It looks like maybe some accidental
whitespace change happened.

------
Kind Regards,
Peter Smith.
Fujitsu Australia

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Dilip Kumar 2023-08-18 04:42:00 Re: New WAL record to detect the checkpoint redo location
Previous Message shveta malik 2023-08-18 03:49:10 Re: Adding a LogicalRepWorker type field