Re: pg_upgrade: optimize replication slot caught-up check

From: Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>
To: shveta malik <shveta(dot)malik(at)gmail(dot)com>
Cc: Chao Li <li(dot)evan(dot)chao(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: pg_upgrade: optimize replication slot caught-up check
Date: 2026-01-22 20:33:28
Message-ID: CAD21AoBQ7aRwfiizn4FFm9d9Lf9uEmn7-66L4eKsCAd17Gy=7g@mail.gmail.com
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, Jan 21, 2026 at 7:49 PM shveta malik <shveta(dot)malik(at)gmail(dot)com> wrote:
>
> On Thu, Jan 22, 2026 at 5:19 AM Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com> wrote:
> >
> > On Mon, Jan 19, 2026 at 10:38 PM shveta malik <shveta(dot)malik(at)gmail(dot)com> wrote:
> > >
> > > On Wed, Jan 14, 2026 at 11:24 PM Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com> wrote:
> > > >
> > > > I've attached the updated patch.
> > > >
> > >
> > > Thank You for the patch. I like the idea of optimization. Few initial comments:
> >
> > Thank you for reviewing the patch!
> >
> > >
> > > 1)
> > > + * The query returns the slot names and their caught-up status in
> > > + * the same order as the results collected by
> > > + * get_old_cluster_logical_slot_infos(). If this query is changed,
> > >
> > > I could not find the function get_old_cluster_logical_slot_infos(), do
> > > you mean get_old_cluster_logical_slot_infos_query()?
> >
> > It seems an oversight in commit 6d3d2e8e541f0. I think it should be
> > get_db_rel_and_slot_infos().
> >
> > >
> > > 2)
> > > " WHERE database = current_database() AND "
> > > " slot_type = 'logical' AND "
> > >
> > > Is there a reason why database = current_database() is placed before
> > > slot_type = 'logical'? I am not sure how the PostgreSQL optimizer and
> > > executor will order these predicates, but from the first look,
> > > slot_type = 'logical' appears cheaper and could be placed first,
> > > consistent with the ordering used at other places.
> >
> > Changed.
> >
> > >
> > > 3)
> > > Shouldn’t we add a sanity check inside
> > > get_old_cluster_logical_slot_infos_query() to ensure that when
> > > skip_caught_up_check is true, we are on PostgreSQL 18 or lower? This
> > > would make the function safer for future use if it's called elsewhere.
> > > I understand the caller already performs a similar check, but I think
> > > it's more appropriate here since we call
> > > binary_upgrade_logical_slot_has_caught_up() from inside, which doesn’t
> > > even exist on newer versions.
> >
> > What kind of sanity check did you mean? We can have a check with
> > pg_fatal() but it seems almost the same to me even if pg_upgrade fails
> > with an error due to missing
> > binary_upgrade_logical_slot_has_caught_up().
>
> I was referring to a development-level sanity check, something like:
>
> /* skip_caught_up_check is required iff PG19 or newer */
> Assert((GET_MAJOR_VERSION(cluster->major_version) >= 1900) ==
> skip_caught_up_check);
>
> But performing this check requires access to the cluster version (or
> cluster information), which this function currently does not have.
> Given that, do you think it would make sense to pass the cluster as an
> argument to this function in order to perform the sanity check here?

Hmm, I think it's better not to have the same check in multiple
places, but it might make sense to have
get_old_cluster_logical_slot_infos_query() decide whether to use the
fast method. I've updated the patch accordingly, please review it.

>
> > >
> > > 4)
> > > +# Check the file content. While both test_slot1 and test_slot2 should
> > > be reporting
> > > +# that they have unconsumed WAL records, test_slot3 should not be reported as
> > > +# it has caught up.
> > >
> > > Can you please elaborate the reason behind test_slot3 not being
> > > reported? Also mention in the comment if possible.
> >
> > We advance test_slot3 to the current WAL LSN before executing
> > pg_upgrade, so the test_slot3 should have consumed all pending WALs.
> > Please refer to the following changes:
>
> I understand the test, and the comments are clear to me. I also
> understand that only test_slot3 is expected to be in the caught-up
> state. My questions were specifically about the following points:
> 1) Why do we expect 'slot3 caught-up' not to be mentioned in the LOG?
> Is it simply because there is no corresponding logging in the code, or
> is this behavior related to some aspect of your fix that I may have
> missed?
>
> 2) In general, we do not validate the absence of LOG messages in
> tests. Why is this considered a special case where such a check is
> appropriate?

What LOG do you refer to? In these tests, we check the
invalid_logical_slots.txt file where pg_upgrade reports only invalid
slots (in terms of pg_upgrade). For test_slot3, it should not be
mentioned in that file as it has caught up. Given that the file has
only invalid slots, checking the absence of test_slot3 in the file
makes sense to me.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Attachment Content-Type Size
v7-0001-pg_upgrade-Optimize-replication-slot-caught-up-ch.patch application/x-patch 20.0 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Zsolt Parragi 2026-01-22 21:10:30 Re: Extensible storage manager API - SMGR hook Redux
Previous Message Pierre Ducroquet 2026-01-22 20:27:29 Re: [PATCH] llvmjit: always add the simplifycfg pass