RE: [PoC] pg_upgrade: allow to upgrade publisher node

From: "Hayato Kuroda (Fujitsu)" <kuroda(dot)hayato(at)fujitsu(dot)com>
To: 'Peter Smith' <smithpb2250(at)gmail(dot)com>
Cc: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Bruce Momjian <bruce(at)momjian(dot)us>, Julien Rouhaud <rjuju123(at)gmail(dot)com>, vignesh C <vignesh21(at)gmail(dot)com>, Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>, "Zhijie Hou (Fujitsu)" <houzj(dot)fnst(at)fujitsu(dot)com>
Subject: RE: [PoC] pg_upgrade: allow to upgrade publisher node
Date: 2023-08-25 02:11:37
Message-ID: TYAPR01MB5866A112A644DDF3D2E0D3E4F5E3A@TYAPR01MB5866.jpnprd01.prod.outlook.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Dear Peter,

Thanks for reviewing! New patch could be available in [1].

> 1. check_for_lost_slots
>
> +
> +/*
> + * Verify that all logical replication slots are usable.
> + */
> +void
> +check_for_lost_slots(ClusterInfo *cluster)
>
> 1a.
> AFAIK we don't ever need to call this also for 'new_cluster'. So the
> function should have no parameter and just access 'old_cluster'
> directly.

Actually I have asked in previous post, and I understood you like the style.
Fixed. Also, get_logical_slot_infos() and count_logical_slots() are also called
only for old_cluster, then removed the argument.

> 1b.
> Can't this be a static function now?

Yeah, changed to static.

> 2.
> + for (i = 0; i < ntups; i++)
> + pg_log(PG_WARNING,
> + "\nWARNING: logical replication slot \"%s\" is in 'lost' state.",
> + PQgetvalue(res, i, i_slotname));
>
> Is it correct that this message also includes the word "WARNING"?
> Other PG_WARNING messages don't do that.

create_script_for_old_cluster_deletion() has the word and I followed that:

```
pg_log(PG_WARNING,
"\nWARNING: new data directory should not be inside the old data directory, i.e. %s", old_cluster_pgdata);
```

> 3. check_for_confirmed_flush_lsn
>
> +/*
> + * Verify that all logical replication slots consumed all WALs, except a
> + * CHECKPOINT_SHUTDOWN record.
> + */
> +static void
> +check_for_confirmed_flush_lsn(ClusterInfo *cluster)
>
> AFAIK we don't ever need to call this also for 'new_cluster'. So the
> function should have no parameter and just access 'old_cluster'
> directly.

Removed.

> 4.
> + for (i = 0; i < ntups; i++)
> + pg_log(PG_WARNING,
> + "\nWARNING: logical replication slot \"%s\" has not consumed WALs yet",
> + PQgetvalue(res, i, i_slotname));
>
> Is it correct that this message also includes the word "WARNING"?
> Other PG_WARNING messages don't do that.

See above reply, create_script_for_old_cluster_deletion() has that.

> src/bin/pg_upgrade/controldata.c
>
> 5. get_control_data
>
> + else if ((p = strstr(bufin, "Latest checkpoint location:")) != NULL)
> + {
> + /*
> + * Gather the latest checkpoint location if the cluster is PG17
> + * or later. This is used for upgrading logical replication
> + * slots.
> + */
> + if (GET_MAJOR_VERSION(cluster->major_version) >= 1700)
>
> But we are not "gathering" anything. It's just one LSN. I think this
> ought to just say "Read the latest..."

Changed.

> 6.
> + /*
> + * The upper and lower part of LSN must be read separately
> + * because it is reported in %X/%X format.
> + */
>
> /reported/stored as/

Changed.

> src/bin/pg_upgrade/pg_upgrade.h
>
> 7.
> +void check_for_lost_slots(ClusterInfo *cluster);\
>
> Why is this needed here? Can't this be a static function?

Removed.

> .../t/003_logical_replication_slots.pl
>
> 8.
> +# 2. Consume WAL records to avoid another type of upgrade failure. It will be
> +# tested in subsequent cases.
> +$old_publisher->safe_psql('postgres',
> + "SELECT count(*) FROM pg_logical_slot_get_changes('test_slot1', NULL,
> NULL);"
> +);
>
> I wondered if that step really needed. Why will there be WAL records to consume?
>
> IIUC we haven't published anything yet.

The primal reason was described in [2], the reply for comment 10.
After creating 'test_slot1', another 'test_slot2' is also created, and the
function generates the RUNNING_XLOG record. The backtrace is as follows:

pg_create_logical_replication_slot
create_logical_replication_slot
CreateInitDecodingContext
ReplicationSlotReserveWal
LogStandbySnapshot
LogCurrentRunningXacts
XLogInsert(RM_STANDBY_ID, XLOG_RUNNING_XACTS);

check_for_confirmed_flush_lsn() detects the record and raises FATAL error before
checking GUC on new cluster.

> 9.
> +# ------------------------------
> +# TEST: Successful upgrade
> +
> +# Preparations for the subsequent test:
> +# 1. Remove the remained slot
> +$old_publisher->start;
> +$old_publisher->safe_psql('postgres',
> + "SELECT * FROM pg_drop_replication_slot('test_slot1');"
> +);
>
> Should removal of the slot be done as part of the cleanup of the
> previous test, instead of preparing for this one?

Moved to cleanup part.

> 10.
> # 3. Disable the subscription once
> $subscriber->safe_psql('postgres', "ALTER SUBSCRIPTION sub DISABLE");
> $old_publisher->stop;
>
> 10a.
> What do you mean by "once"?

I added the word because the subscription would be enabled again.
But after considering more, I thought "Temporarily" seems better. Fixed.

> 10b.
> That old_publisher->stop; seems strangely placed. Why is it here?

We must shut down the cluster before doing pg_upgrade. Isn't it same as line 124?

```
# 2. Generate extra WAL records. Because these WAL records do not get consumed
# it will cause the upcoming pg_upgrade test to fail.
$old_publisher->safe_psql('postgres',
"CREATE TABLE tbl AS SELECT generate_series(1, 10) AS a;"
);
$old_publisher->stop;
```

> 11.
> # Check that the slot 'test_slot1' has migrated to the new cluster
> $new_publisher->start;
> my $result = $new_publisher->safe_psql('postgres',
> "SELECT slot_name, two_phase FROM pg_replication_slots");
> is($result, qq(sub|t), 'check the slot exists on new cluster');
>
> ~
>
> That comment now seems wrong. That slot was previously removed, right?

Yeah, it should be 'sub'. Changed.

> 12.
> # Update the connection
> my $new_connstr = $new_publisher->connstr . ' dbname=postgres';
> $subscriber->safe_psql('postgres',
> "ALTER SUBSCRIPTION sub CONNECTION '$new_connstr'");
> $subscriber->safe_psql('postgres', "ALTER SUBSCRIPTION sub ENABLE");
>
> ~
>
> Maybe better to combine both SQL.

Combined.

[1]: https://www.postgresql.org/message-id/TYAPR01MB5866D7677BAE6F66839570FCF5E3A%40TYAPR01MB5866.jpnprd01.prod.outlook.com
[2]: https://www.postgresql.org/message-id/TYAPR01MB58668021BB233D129B466122F51CA%40TYAPR01MB5866.jpnprd01.prod.outlook.com

Best Regards,
Hayato Kuroda
FUJITSU LIMITED

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Tristan Partin 2023-08-25 02:19:22 Re: psql --no-connect option
Previous Message Hayato Kuroda (Fujitsu) 2023-08-25 02:10:12 RE: [PoC] pg_upgrade: allow to upgrade publisher node