| From: | Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com> |
|---|---|
| To: | shveta malik <shveta(dot)malik(at)gmail(dot)com> |
| Cc: | Chao Li <li(dot)evan(dot)chao(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org> |
| Subject: | Re: pg_upgrade: optimize replication slot caught-up check |
| Date: | 2026-01-22 20:33:28 |
| Message-ID: | CAD21AoBQ7aRwfiizn4FFm9d9Lf9uEmn7-66L4eKsCAd17Gy=7g@mail.gmail.com |
| Views: | Whole Thread | Raw Message | Download mbox | Resend email |
| Thread: | |
| Lists: | pgsql-hackers |
On Wed, Jan 21, 2026 at 7:49 PM shveta malik <shveta(dot)malik(at)gmail(dot)com> wrote:
>
> On Thu, Jan 22, 2026 at 5:19 AM Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com> wrote:
> >
> > On Mon, Jan 19, 2026 at 10:38 PM shveta malik <shveta(dot)malik(at)gmail(dot)com> wrote:
> > >
> > > On Wed, Jan 14, 2026 at 11:24 PM Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com> wrote:
> > > >
> > > > I've attached the updated patch.
> > > >
> > >
> > > Thank You for the patch. I like the idea of optimization. Few initial comments:
> >
> > Thank you for reviewing the patch!
> >
> > >
> > > 1)
> > > + * The query returns the slot names and their caught-up status in
> > > + * the same order as the results collected by
> > > + * get_old_cluster_logical_slot_infos(). If this query is changed,
> > >
> > > I could not find the function get_old_cluster_logical_slot_infos(), do
> > > you mean get_old_cluster_logical_slot_infos_query()?
> >
> > It seems an oversight in commit 6d3d2e8e541f0. I think it should be
> > get_db_rel_and_slot_infos().
> >
> > >
> > > 2)
> > > " WHERE database = current_database() AND "
> > > " slot_type = 'logical' AND "
> > >
> > > Is there a reason why database = current_database() is placed before
> > > slot_type = 'logical'? I am not sure how the PostgreSQL optimizer and
> > > executor will order these predicates, but from the first look,
> > > slot_type = 'logical' appears cheaper and could be placed first,
> > > consistent with the ordering used at other places.
> >
> > Changed.
> >
> > >
> > > 3)
> > > Shouldn’t we add a sanity check inside
> > > get_old_cluster_logical_slot_infos_query() to ensure that when
> > > skip_caught_up_check is true, we are on PostgreSQL 18 or lower? This
> > > would make the function safer for future use if it's called elsewhere.
> > > I understand the caller already performs a similar check, but I think
> > > it's more appropriate here since we call
> > > binary_upgrade_logical_slot_has_caught_up() from inside, which doesn’t
> > > even exist on newer versions.
> >
> > What kind of sanity check did you mean? We can have a check with
> > pg_fatal() but it seems almost the same to me even if pg_upgrade fails
> > with an error due to missing
> > binary_upgrade_logical_slot_has_caught_up().
>
> I was referring to a development-level sanity check, something like:
>
> /* skip_caught_up_check is required iff PG19 or newer */
> Assert((GET_MAJOR_VERSION(cluster->major_version) >= 1900) ==
> skip_caught_up_check);
>
> But performing this check requires access to the cluster version (or
> cluster information), which this function currently does not have.
> Given that, do you think it would make sense to pass the cluster as an
> argument to this function in order to perform the sanity check here?
Hmm, I think it's better not to have the same check in multiple
places, but it might make sense to have
get_old_cluster_logical_slot_infos_query() decide whether to use the
fast method. I've updated the patch accordingly, please review it.
>
> > >
> > > 4)
> > > +# Check the file content. While both test_slot1 and test_slot2 should
> > > be reporting
> > > +# that they have unconsumed WAL records, test_slot3 should not be reported as
> > > +# it has caught up.
> > >
> > > Can you please elaborate the reason behind test_slot3 not being
> > > reported? Also mention in the comment if possible.
> >
> > We advance test_slot3 to the current WAL LSN before executing
> > pg_upgrade, so the test_slot3 should have consumed all pending WALs.
> > Please refer to the following changes:
>
> I understand the test, and the comments are clear to me. I also
> understand that only test_slot3 is expected to be in the caught-up
> state. My questions were specifically about the following points:
> 1) Why do we expect 'slot3 caught-up' not to be mentioned in the LOG?
> Is it simply because there is no corresponding logging in the code, or
> is this behavior related to some aspect of your fix that I may have
> missed?
>
> 2) In general, we do not validate the absence of LOG messages in
> tests. Why is this considered a special case where such a check is
> appropriate?
What LOG do you refer to? In these tests, we check the
invalid_logical_slots.txt file where pg_upgrade reports only invalid
slots (in terms of pg_upgrade). For test_slot3, it should not be
mentioned in that file as it has caught up. Given that the file has
only invalid slots, checking the absence of test_slot3 in the file
makes sense to me.
Regards,
--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com
| Attachment | Content-Type | Size |
|---|---|---|
| v7-0001-pg_upgrade-Optimize-replication-slot-caught-up-ch.patch | application/x-patch | 20.0 KB |
| From | Date | Subject | |
|---|---|---|---|
| Next Message | Zsolt Parragi | 2026-01-22 21:10:30 | Re: Extensible storage manager API - SMGR hook Redux |
| Previous Message | Pierre Ducroquet | 2026-01-22 20:27:29 | Re: [PATCH] llvmjit: always add the simplifycfg pass |