Re: [PoC] pg_upgrade: allow to upgrade publisher node

From: Bharath Rupireddy <bharath(dot)rupireddyforpostgres(at)gmail(dot)com>
To: "Hayato Kuroda (Fujitsu)" <kuroda(dot)hayato(at)fujitsu(dot)com>
Cc: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Dilip Kumar <dilipbalaut(at)gmail(dot)com>, "Zhijie Hou (Fujitsu)" <houzj(dot)fnst(at)fujitsu(dot)com>, Peter Smith <smithpb2250(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Bruce Momjian <bruce(at)momjian(dot)us>, Julien Rouhaud <rjuju123(at)gmail(dot)com>, vignesh C <vignesh21(at)gmail(dot)com>, Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>
Subject: Re: [PoC] pg_upgrade: allow to upgrade publisher node
Date: 2023-09-21 11:27:44
Message-ID: CALj2ACWb3NiO9bSH2r7_Hcd3DRoS3NBytkLTwYQ8RRh1Ky8dFQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, Sep 20, 2023 at 7:20 PM Hayato Kuroda (Fujitsu)
<kuroda(dot)hayato(at)fujitsu(dot)com> wrote:
>
> Good catch, I could not notice because it worked well in my RHEL. Here is the
> updated version.

Thanks for the patch. I have some comments on v42:

1.
+{ oid => '8046', descr => 'for use by pg_upgrade',
+ proname => 'binary_upgrade_validate_wal_records', proisstrict => 'f',
+ provolatile => 'v', proparallel => 'u', prorettype => 'bool',

+ if (PG_ARGISNULL(0))
+ elog(ERROR, "null argument to
binary_upgrade_validate_wal_records is not allowed");

Can proisstrict => 'f' be removed so that there's no need for explicit
PG_ARGISNULL check? Any specific reason to keep it?

And, the before the ISNULL check the arg is read, which isn't good.

2.
+Datum
+binary_upgrade_validate_wal_records(PG_FUNCTION_ARGS)

The function name looks too generic in the sense that it validates WAL
records for correctness/corruption, but it is not. Can it be something
like binary_upgrade_{check_for_wal_logical_end,
check_for_logical_end_of_wal} or such?

3.
+ /* Quick exit if the given lsn is larger than current one */
+ if (start_lsn >= GetFlushRecPtr(NULL))
+ PG_RETURN_BOOL(false);
+

An LSN that doesn't exists yet is an error IMO, may be an error better here?

4.
+ * This function is used to verify that there are no WAL records (except some
+ * types) after confirmed_flush_lsn of logical slots, which means all the
+ * changes were replicated to the subscriber. There is a possibility that some
+ * WALs are inserted during upgrade, so such types would be ignored.
+ *

This comment before the function better be at the callsite of the
function, because as far as this function is concerned, it checks if
there are any WAL records that are not "certain" types after the given
LSN, it doesn't know logical slots or confirmed_flush_lsn or such.

5. Trying to understand the interaction of this feature with custom
WAL records that a custom WAL resource manager puts in. Is it okay to
have custom WAL records after the "logical WAL end"?
+ /*
+ * There is a possibility that following records may be generated
+ * during the upgrade.
+ */

6.
+ if (PQntuples(res) != 1)
+ pg_fatal("could not count the number of logical replication slots");
+

Not existing a single logical replication slot an error? I think it
must be if (PQntuples(res) == 0) return;?

7. A nit:
+ nslots_on_new = atoi(PQgetvalue(res, 0, 0));
+
+ if (nslots_on_new)

Just do if(atoi(PQgetvalue(res, 0, 0)) > 0) and get rid of nslots_on_new?

8.
+ if (nslots_on_new)
+ pg_fatal("expected 0 logical replication slots but found %d",
+ nslots_on_new);

How about "New cluster database is containing logical replication
slots", note that the some of the fatal messages are starting with an
upper-case letter.

9.
+ res = executeQueryOrDie(conn, "SHOW wal_level;");
+ res = executeQueryOrDie(conn, "SHOW max_replication_slots;");

Instead of 2 queries to determine required parameters, isn't it better
with a single query like the following?

select setting from pg_settings where name in ('wal_level',
'max_replication_slots') order by name;

10.
Why just wal_level and max_replication_slots, why not
max_worker_processes and max_wal_senders too? I'm looking at
RecoveryRequiresIntParameter and if they are different on the upgraded
instance, chances that the logical replication won't work, no?

11.
+# 2. Generate extra WAL records. Because these WAL records do not get consumed
+# it will cause the upcoming pg_upgrade test to fail.
+$old_publisher->safe_psql('postgres',
+ "CREATE TABLE tbl AS SELECT generate_series(1, 10) AS a;"
+);
+$old_publisher->stop;

This might be a recipie for sporadic test failures - how is it
guaranteed that the newly generated WAL records aren't consumed.

May be stop subscriber or temporarily disable the subscription and
then generate WAL records?

12.
+extern XLogReaderState *InitXLogReaderState(XLogRecPtr lsn);
+extern XLogRecord *ReadNextXLogRecord(XLogReaderState *xlogreader);
+

Why not these functions be defined in xlogreader.h with elog/ereport
in #ifndef FRONTEND #endif blocks? IMO, xlogreader.h seems right
location for these functions.

13.
+LogicalReplicationSlotInfo

Where is this structure defined?

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Jakub Wartak 2023-09-21 12:10:09 Re: pg_stat_statements and "IN" conditions
Previous Message Etsuro Fujita 2023-09-21 11:07:56 Re: Comment about set_join_pathlist_hook()