| From: | Chao Li <li(dot)evan(dot)chao(at)gmail(dot)com> |
|---|---|
| To: | Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com> |
| Cc: | shveta malik <shveta(dot)malik(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org> |
| Subject: | Re: pg_upgrade: optimize replication slot caught-up check |
| Date: | 2026-01-23 01:16:17 |
| Message-ID: | B098C626-EE7D-4E98-8685-FBB7980C795E@gmail.com |
| Views: | Whole Thread | Raw Message | Download mbox | Resend email |
| Thread: | |
| Lists: | pgsql-hackers |
> On Jan 23, 2026, at 04:33, Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com> wrote:
>
> On Wed, Jan 21, 2026 at 7:49 PM shveta malik <shveta(dot)malik(at)gmail(dot)com> wrote:
>>
>> On Thu, Jan 22, 2026 at 5:19 AM Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com> wrote:
>>>
>>> On Mon, Jan 19, 2026 at 10:38 PM shveta malik <shveta(dot)malik(at)gmail(dot)com> wrote:
>>>>
>>>> On Wed, Jan 14, 2026 at 11:24 PM Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com> wrote:
>>>>>
>>>>> I've attached the updated patch.
>>>>>
>>>>
>>>> Thank You for the patch. I like the idea of optimization. Few initial comments:
>>>
>>> Thank you for reviewing the patch!
>>>
>>>>
>>>> 1)
>>>> + * The query returns the slot names and their caught-up status in
>>>> + * the same order as the results collected by
>>>> + * get_old_cluster_logical_slot_infos(). If this query is changed,
>>>>
>>>> I could not find the function get_old_cluster_logical_slot_infos(), do
>>>> you mean get_old_cluster_logical_slot_infos_query()?
>>>
>>> It seems an oversight in commit 6d3d2e8e541f0. I think it should be
>>> get_db_rel_and_slot_infos().
>>>
>>>>
>>>> 2)
>>>> " WHERE database = current_database() AND "
>>>> " slot_type = 'logical' AND "
>>>>
>>>> Is there a reason why database = current_database() is placed before
>>>> slot_type = 'logical'? I am not sure how the PostgreSQL optimizer and
>>>> executor will order these predicates, but from the first look,
>>>> slot_type = 'logical' appears cheaper and could be placed first,
>>>> consistent with the ordering used at other places.
>>>
>>> Changed.
>>>
>>>>
>>>> 3)
>>>> Shouldn’t we add a sanity check inside
>>>> get_old_cluster_logical_slot_infos_query() to ensure that when
>>>> skip_caught_up_check is true, we are on PostgreSQL 18 or lower? This
>>>> would make the function safer for future use if it's called elsewhere.
>>>> I understand the caller already performs a similar check, but I think
>>>> it's more appropriate here since we call
>>>> binary_upgrade_logical_slot_has_caught_up() from inside, which doesn’t
>>>> even exist on newer versions.
>>>
>>> What kind of sanity check did you mean? We can have a check with
>>> pg_fatal() but it seems almost the same to me even if pg_upgrade fails
>>> with an error due to missing
>>> binary_upgrade_logical_slot_has_caught_up().
>>
>> I was referring to a development-level sanity check, something like:
>>
>> /* skip_caught_up_check is required iff PG19 or newer */
>> Assert((GET_MAJOR_VERSION(cluster->major_version) >= 1900) ==
>> skip_caught_up_check);
>>
>> But performing this check requires access to the cluster version (or
>> cluster information), which this function currently does not have.
>> Given that, do you think it would make sense to pass the cluster as an
>> argument to this function in order to perform the sanity check here?
>
> Hmm, I think it's better not to have the same check in multiple
> places, but it might make sense to have
> get_old_cluster_logical_slot_infos_query() decide whether to use the
> fast method. I've updated the patch accordingly, please review it.
>
>>
>>>>
>>>> 4)
>>>> +# Check the file content. While both test_slot1 and test_slot2 should
>>>> be reporting
>>>> +# that they have unconsumed WAL records, test_slot3 should not be reported as
>>>> +# it has caught up.
>>>>
>>>> Can you please elaborate the reason behind test_slot3 not being
>>>> reported? Also mention in the comment if possible.
>>>
>>> We advance test_slot3 to the current WAL LSN before executing
>>> pg_upgrade, so the test_slot3 should have consumed all pending WALs.
>>> Please refer to the following changes:
>>
>> I understand the test, and the comments are clear to me. I also
>> understand that only test_slot3 is expected to be in the caught-up
>> state. My questions were specifically about the following points:
>> 1) Why do we expect 'slot3 caught-up' not to be mentioned in the LOG?
>> Is it simply because there is no corresponding logging in the code, or
>> is this behavior related to some aspect of your fix that I may have
>> missed?
>>
>> 2) In general, we do not validate the absence of LOG messages in
>> tests. Why is this considered a special case where such a check is
>> appropriate?
>
> What LOG do you refer to? In these tests, we check the
> invalid_logical_slots.txt file where pg_upgrade reports only invalid
> slots (in terms of pg_upgrade). For test_slot3, it should not be
> mentioned in that file as it has caught up. Given that the file has
> only invalid slots, checking the absence of test_slot3 in the file
> makes sense to me.
>
>
> Regards,
>
> --
> Masahiko Sawada
> Amazon Web Services: https://aws.amazon.com
> <v7-0001-pg_upgrade-Optimize-replication-slot-caught-up-ch.patch>
I just went through v7, and overall looks good to me.
Only nitpick is:
```
+ *
+ * use_fast_caught_up_check is set to true on return if available in the given
+ * cluster.
```
Strictly speaking, use_fast_caught_up_check is a pointer, it cannot be set, it is *use_fast_caught_up_check that is set to true. So, please add a star sign in front of use_fast_caught_up_check.
Best regards,
--
Chao Li (Evan)
HighGo Software Co., Ltd.
https://www.highgo.com/
| From | Date | Subject | |
|---|---|---|---|
| Next Message | Andres Freund | 2026-01-23 01:18:21 | Re: More speedups for tuple deformation |
| Previous Message | Chao Li | 2026-01-23 00:54:33 | Re: tablecmds: reject CLUSTER ON for partitioned tables earlier |