RE: [PoC] pg_upgrade: allow to upgrade publisher node

From: "Hayato Kuroda (Fujitsu)" <kuroda(dot)hayato(at)fujitsu(dot)com>
To: 'Amit Kapila' <amit(dot)kapila16(at)gmail(dot)com>
Cc: PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Julien Rouhaud <rjuju123(at)gmail(dot)com>, vignesh C <vignesh21(at)gmail(dot)com>
Subject: RE: [PoC] pg_upgrade: allow to upgrade publisher node
Date: 2023-08-02 08:13:38
Message-ID: TYAPR01MB5866FD3F7992A46D0457F0E6F50BA@TYAPR01MB5866.jpnprd01.prod.outlook.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Dear Amit,

Thank you for giving comments! PSA new version patchset.

> 1. Do we really need 0001 patch after the latest change proposed by
> Vignesh in the 0004 patch?

I removed 0001 patch and revived old patch which serializes slots at shutdown.
This is because the problem which slots are not serialized to disk still remain [1]
and then confirmed_flush becomes behind, even if we implement the approach.

> 2.
> + if (dopt.logical_slots_only)
> + {
> + if (!dopt.binary_upgrade)
> + pg_fatal("options --logical-replication-slots-only requires option
> --binary-upgrade");
> +
> + if (dopt.dataOnly)
> + pg_fatal("options --logical-replication-slots-only and
> -a/--data-only cannot be used together");
> +
> + if (dopt.schemaOnly)
> + pg_fatal("options --logical-replication-slots-only and
> -s/--schema-only cannot be used together");
>
> Can you please explain why the patch imposes these restrictions? I
> guess the binary_upgrade is because you want this option to be used
> for the upgrade. Do we want to avoid giving any other option with
> logical_slots, if so, are the above checks sufficient and why?

Regarding the --binary-upgrade, the motivation is same as you expected. I covered
up the --logical-replication-slots-only option from users, so it should not be
used not for upgrade. Additionaly, this option is not shown in help and document.

As for -{data|schema}-only options, I removed restrictions.
Firstly I set as excluded because it may be confused - as discussed at [2], slots
must be dumped after all the pg_resetwal is done and at that time all the definitions
are already dumped. to avoid duplicated definitions, we must ensure only slots are
written in the output file. I thought this requirement contradict descirptions of
these options (Dump only the A, not B).
But after considering more, I thought this might not be needed because it was not
opened to users - no one would be confused by using both them.
(Restriction for -c is also removed for the same motivation)

> 3.
> + /*
> + * Get replication slots.
> + *
> + * XXX: Which information must be extracted from old node? Currently three
> + * attributes are extracted because they are used by
> + * pg_create_logical_replication_slot().
> + */
> + appendPQExpBufferStr(query,
> + "SELECT slot_name, plugin, two_phase "
> + "FROM pg_catalog.pg_replication_slots "
> + "WHERE database = current_database() AND temporary = false "
> + "AND wal_status IN ('reserved', 'extended');");
>
> Why are we ignoring the slots that have wal status as WALAVAIL_REMOVED
> or WALAVAIL_UNRESERVED? I think the slots where wal status is
> WALAVAIL_REMOVED, the corresponding slots are invalidated at some
> point. I think such slots can't be used for decoding but these will be
> dropped along with the subscription or when a user does it manually.
> So, if we don't copy such slots after the upgrade then there could be
> a problem in dropping the corresponding subscription. If we don't want
> to copy over such slots then we need to provide instructions on what
> users should do in such cases. OTOH, if we want to copy over such
> slots then we need to find a way to invalidate such slots after copy.
> Either way, this needs more analysis.

I considered again here. At least WALAVAIL_UNRESERVED should be supported because
the slot is still usable. It can return reserved or extended.

As for WALAVAIL_REMOVED, I don't think it should be so that I added a description
to the document.

This feature re-create slots which have same name/plugins as old ones, not replicate
its state. So if we copy them as-is slots become usable again. If subscribers refer
the slot and then connect again at that time, changes between 'WALAVAIL_REMOVED'
may be lost.

Based on above slots must be copied as WALAVAIL_REMOVED, but as you said, we do
not have a way to control that. the status is calculated by using restart_lsn,
but there are no function to modify directly.

One approach is adding an SQL funciton which set restart_lsn to aritrary value
(or 0/0, invalid), but it seems dangerous.

> 4.
> + /*
> + * Check that all logical replication slots have reached the current WAL
> + * position.
> + */
> + res = executeQueryOrDie(conn,
> + "SELECT slot_name FROM pg_catalog.pg_replication_slots "
> + "WHERE (SELECT count(record_type) "
> + " FROM pg_catalog.pg_get_wal_records_content(confirmed_flush_lsn,
> pg_catalog.pg_current_wal_insert_lsn()) "
> + " WHERE record_type != 'CHECKPOINT_SHUTDOWN') <> 0 "
> + "AND temporary = false AND wal_status IN ('reserved', 'extended');");
>
> I think this can unnecessarily lead to reading a lot of WAL data if
> the confirmed_flush_lsn for a slot is too much behind. Can we think of
> improving this by passing the number of records to read which in this
> case should be 1?

I checked and pg_wal_record_info() seemed to be used for the purpose. I tried to
move the functionality to core.

But this function raise an ERROR when there is no valid record after the specified
lsn. This means that the pg_upgrade fails if logical slots has caught up the current
WAL location. IIUC DBA must do following steps:

1. shutdown old publisher
2. disable the subscription once <- this is mandatory, otherwise the walsender may
send the record during the upgrade and confirmed_lsn may point the SHUTDOWN_CHECKPOINT
3. do pg_upgrade <- pg_get_wal_record_content() may raise an ERROR if 2. was skipped
4. change the connection string of subscription
5. enable the subscription again

If we think this is not robust, we must implement similar function which does not raise ERROR instead.
How do you think?

[1]: https://www.postgresql.org/message-id/20230414061248.vdsxz2febjo3re6h%40jrouhaud
[2]: https://www.postgresql.org/message-id/CAA4eK1KD-hZ3syruxJA6fK-JtSBzL6etkwToPuTmVkrCvT6ASw@mail.gmail.com

Best Regards,
Hayato Kuroda
FUJITSU LIMITED

Attachment Content-Type Size
v18-0001-pg_upgrade-Add-include-logical-replication-slots.patch application/octet-stream 34.9 KB
v18-0002-Always-persist-to-disk-logical-slots-during-a-sh.patch application/octet-stream 4.8 KB
v18-0003-Move-pg_get_wal_record_info-functionality-from-p.patch application/octet-stream 21.4 KB
v18-0004-pg_upgrade-Add-check-function-for-include-logica.patch application/octet-stream 11.9 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Peter Eisentraut 2023-08-02 08:29:39 Re: cataloguing NOT NULL constraints
Previous Message Hayato Kuroda (Fujitsu) 2023-08-02 08:13:14 RE: [PoC] pg_upgrade: allow to upgrade publisher node