Re: pg_upgrade: optimize replication slot caught-up check

From: Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>
To: shveta malik <shveta(dot)malik(at)gmail(dot)com>
Cc: Chao Li <li(dot)evan(dot)chao(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: pg_upgrade: optimize replication slot caught-up check
Date: 2026-01-21 23:49:14
Message-ID: CAD21AoAqd8=7km_AENWtv2TddXfn3P2veiTmqdmeYh2bSC0owA@mail.gmail.com
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Mon, Jan 19, 2026 at 10:38 PM shveta malik <shveta(dot)malik(at)gmail(dot)com> wrote:
>
> On Wed, Jan 14, 2026 at 11:24 PM Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com> wrote:
> >
> > I've attached the updated patch.
> >
>
> Thank You for the patch. I like the idea of optimization. Few initial comments:

Thank you for reviewing the patch!

>
> 1)
> + * The query returns the slot names and their caught-up status in
> + * the same order as the results collected by
> + * get_old_cluster_logical_slot_infos(). If this query is changed,
>
> I could not find the function get_old_cluster_logical_slot_infos(), do
> you mean get_old_cluster_logical_slot_infos_query()?

It seems an oversight in commit 6d3d2e8e541f0. I think it should be
get_db_rel_and_slot_infos().

>
> 2)
> " WHERE database = current_database() AND "
> " slot_type = 'logical' AND "
>
> Is there a reason why database = current_database() is placed before
> slot_type = 'logical'? I am not sure how the PostgreSQL optimizer and
> executor will order these predicates, but from the first look,
> slot_type = 'logical' appears cheaper and could be placed first,
> consistent with the ordering used at other places.

Changed.

>
> 3)
> Shouldn’t we add a sanity check inside
> get_old_cluster_logical_slot_infos_query() to ensure that when
> skip_caught_up_check is true, we are on PostgreSQL 18 or lower? This
> would make the function safer for future use if it's called elsewhere.
> I understand the caller already performs a similar check, but I think
> it's more appropriate here since we call
> binary_upgrade_logical_slot_has_caught_up() from inside, which doesn’t
> even exist on newer versions.

What kind of sanity check did you mean? We can have a check with
pg_fatal() but it seems almost the same to me even if pg_upgrade fails
with an error due to missing
binary_upgrade_logical_slot_has_caught_up().

>
> 4)
> +# Check the file content. While both test_slot1 and test_slot2 should
> be reporting
> +# that they have unconsumed WAL records, test_slot3 should not be reported as
> +# it has caught up.
>
> Can you please elaborate the reason behind test_slot3 not being
> reported? Also mention in the comment if possible.

We advance test_slot3 to the current WAL LSN before executing
pg_upgrade, so the test_slot3 should have consumed all pending WALs.
Please refer to the following changes:

# Preparations for the subsequent test:
-# 1. Generate extra WAL records. At this point neither test_slot1 nor
-# test_slot2 has consumed them.
+# 1. Generate extra WAL records. At this point none of slots has consumed them.
#
# 2. Advance the slot test_slot2 up to the current WAL location, but test_slot1
# still has unconsumed WAL records.
#
# 3. Emit a non-transactional message. This will cause test_slot2 to detect the
# unconsumed WAL record.
+#
+# 4. Advance the slot test_slots3 up to the current WAL location.
$oldpub->start;
$oldpub->safe_psql(
'postgres', qq[
CREATE TABLE tbl AS SELECT generate_series(1, 10) AS a;
SELECT pg_replication_slot_advance('test_slot2', pg_current_wal_lsn());
- SELECT count(*) FROM pg_logical_emit_message('false',
'prefix', 'This is a non-transactional message');
+ SELECT count(*) FROM pg_logical_emit_message('false',
'prefix', 'This is a non-transactional message', true);
+ SELECT pg_replication_slot_advance('test_slot3', pg_current_wal_lsn());

I believe that the following new comment explains the reason well:

+# Check the file content. While both test_slot1 and test_slot2 should
be reporting
+# that they have unconsumed WAL records, test_slot3 should not be reported as
+# it has caught up.

I've attached the updated patch.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Attachment Content-Type Size
v6-0001-pg_upgrade-Optimize-replication-slot-caught-up-ch.patch application/octet-stream 19.6 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Michael Paquier 2026-01-22 00:02:10 Re: Flush some statistics within running transactions
Previous Message Sami Imseih 2026-01-21 23:41:13 Re: Flush some statistics within running transactions