Re: pg_upgrade: optimize replication slot caught-up check

From: shveta malik <shveta(dot)malik(at)gmail(dot)com>
To: Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>
Cc: Chao Li <li(dot)evan(dot)chao(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, shveta malik <shveta(dot)malik(at)gmail(dot)com>
Subject: Re: pg_upgrade: optimize replication slot caught-up check
Date: 2026-01-22 03:49:37
Message-ID: CAJpy0uAbWuJfqOzPHSFdieem6fVWErtGoj2pzRyTVbcT14PT+Q@mail.gmail.com
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Thu, Jan 22, 2026 at 5:19 AM Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com> wrote:
>
> On Mon, Jan 19, 2026 at 10:38 PM shveta malik <shveta(dot)malik(at)gmail(dot)com> wrote:
> >
> > On Wed, Jan 14, 2026 at 11:24 PM Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com> wrote:
> > >
> > > I've attached the updated patch.
> > >
> >
> > Thank You for the patch. I like the idea of optimization. Few initial comments:
>
> Thank you for reviewing the patch!
>
> >
> > 1)
> > + * The query returns the slot names and their caught-up status in
> > + * the same order as the results collected by
> > + * get_old_cluster_logical_slot_infos(). If this query is changed,
> >
> > I could not find the function get_old_cluster_logical_slot_infos(), do
> > you mean get_old_cluster_logical_slot_infos_query()?
>
> It seems an oversight in commit 6d3d2e8e541f0. I think it should be
> get_db_rel_and_slot_infos().
>
> >
> > 2)
> > " WHERE database = current_database() AND "
> > " slot_type = 'logical' AND "
> >
> > Is there a reason why database = current_database() is placed before
> > slot_type = 'logical'? I am not sure how the PostgreSQL optimizer and
> > executor will order these predicates, but from the first look,
> > slot_type = 'logical' appears cheaper and could be placed first,
> > consistent with the ordering used at other places.
>
> Changed.
>
> >
> > 3)
> > Shouldn’t we add a sanity check inside
> > get_old_cluster_logical_slot_infos_query() to ensure that when
> > skip_caught_up_check is true, we are on PostgreSQL 18 or lower? This
> > would make the function safer for future use if it's called elsewhere.
> > I understand the caller already performs a similar check, but I think
> > it's more appropriate here since we call
> > binary_upgrade_logical_slot_has_caught_up() from inside, which doesn’t
> > even exist on newer versions.
>
> What kind of sanity check did you mean? We can have a check with
> pg_fatal() but it seems almost the same to me even if pg_upgrade fails
> with an error due to missing
> binary_upgrade_logical_slot_has_caught_up().

I was referring to a development-level sanity check, something like:

/* skip_caught_up_check is required iff PG19 or newer */
Assert((GET_MAJOR_VERSION(cluster->major_version) >= 1900) ==
skip_caught_up_check);

But performing this check requires access to the cluster version (or
cluster information), which this function currently does not have.
Given that, do you think it would make sense to pass the cluster as an
argument to this function in order to perform the sanity check here?

> >
> > 4)
> > +# Check the file content. While both test_slot1 and test_slot2 should
> > be reporting
> > +# that they have unconsumed WAL records, test_slot3 should not be reported as
> > +# it has caught up.
> >
> > Can you please elaborate the reason behind test_slot3 not being
> > reported? Also mention in the comment if possible.
>
> We advance test_slot3 to the current WAL LSN before executing
> pg_upgrade, so the test_slot3 should have consumed all pending WALs.
> Please refer to the following changes:

I understand the test, and the comments are clear to me. I also
understand that only test_slot3 is expected to be in the caught-up
state. My questions were specifically about the following points:
1) Why do we expect 'slot3 caught-up' not to be mentioned in the LOG?
Is it simply because there is no corresponding logging in the code, or
is this behavior related to some aspect of your fix that I may have
missed?
2) In general, we do not validate the absence of LOG messages in
tests. Why is this considered a special case where such a check is
appropriate?

> # Preparations for the subsequent test:
> -# 1. Generate extra WAL records. At this point neither test_slot1 nor
> -# test_slot2 has consumed them.
> +# 1. Generate extra WAL records. At this point none of slots has consumed them.
> #
> # 2. Advance the slot test_slot2 up to the current WAL location, but test_slot1
> # still has unconsumed WAL records.
> #
> # 3. Emit a non-transactional message. This will cause test_slot2 to detect the
> # unconsumed WAL record.
> +#
> +# 4. Advance the slot test_slots3 up to the current WAL location.
> $oldpub->start;
> $oldpub->safe_psql(
> 'postgres', qq[
> CREATE TABLE tbl AS SELECT generate_series(1, 10) AS a;
> SELECT pg_replication_slot_advance('test_slot2', pg_current_wal_lsn());
> - SELECT count(*) FROM pg_logical_emit_message('false',
> 'prefix', 'This is a non-transactional message');
> + SELECT count(*) FROM pg_logical_emit_message('false',
> 'prefix', 'This is a non-transactional message', true);
> + SELECT pg_replication_slot_advance('test_slot3', pg_current_wal_lsn());
>
> I believe that the following new comment explains the reason well:
>
> +# Check the file content. While both test_slot1 and test_slot2 should
> be reporting
> +# that they have unconsumed WAL records, test_slot3 should not be reported as
> +# it has caught up.
>
> I've attached the updated patch.
>
> Regards,
>
> --
> Masahiko Sawada
> Amazon Web Services: https://aws.amazon.com

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Vaibhav Dalvi 2026-01-22 04:13:32 Re: finish TODOs in to_json_is_immutable, to_jsonb_is_immutable also add tests on it
Previous Message Zhijie Hou (Fujitsu) 2026-01-22 03:38:38 RE: Newly created replication slot may be invalidated by checkpoint