Re: POC: enable logical decoding when wal_level = 'replica' without a server restart

From: Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>
To: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
Cc: shveta malik <shveta(dot)malik(at)gmail(dot)com>, Peter Smith <smithpb2250(at)gmail(dot)com>, "Hayato Kuroda (Fujitsu)" <kuroda(dot)hayato(at)fujitsu(dot)com>, Ashutosh Bapat <ashutosh(dot)bapat(dot)oss(at)gmail(dot)com>, Shlok Kyal <shlok(dot)kyal(dot)oss(at)gmail(dot)com>, Bertrand Drouvot <bertranddrouvot(dot)pg(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: POC: enable logical decoding when wal_level = 'replica' without a server restart
Date: 2025-12-01 06:45:34
Message-ID: CAD21AoB1mg4-o3r9jd7MZ=gRhN9PE=0kvS0whBVuMoxU5mGqYQ@mail.gmail.com
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Thu, Nov 27, 2025 at 4:59 AM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
>
> On Thu, Nov 27, 2025 at 2:32 AM Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com> wrote:
> >
> > I've squashed all fixup patches and attached the updated patch.
> >
>
> 1.
> <literal>wal_level_insufficient</literal> means that the
> - primary doesn't have a <xref linkend="guc-wal-level"/> sufficient to
> - perform logical decoding. It is set only for logical slots.
> + primary doesn't have a <xref linkend="guc-effective-wal-level"/>
> + to perform logical decoding.
>
> sufficient is missing after "guc-effective-wal-level"
>
> 2.
> + * With 'minimal' WAL level, there are not logical replication slots
> + * during recovery.
>
> /not/no. Typo
>
> 3.
> case XLOG_LOGICAL_DECODING_STATUS_CHANGE:
> {
> - xl_parameter_change *xlrec =
> - (xl_parameter_change *) XLogRecGetData(buf->record);
> + bool logical_decoding;
>
> - /*
> - * If wal_level on the primary is reduced to less than
> - * logical, we want to prevent existing logical slots from
> - * being used. Existing logical slots on the standby get
> - * invalidated when this WAL record is replayed; and further,
> - * slot creation fails when wal_level is not sufficient; but
> - * all these operations are not synchronized, so a logical
> - * slot may creep in while the wal_level is being reduced.
> - * Hence this extra check.
> - */
> - if (xlrec->wal_level < WAL_LEVEL_LOGICAL)
> + memcpy(&logical_decoding, XLogRecGetData(buf->record), sizeof(bool));
>
> The patch has entirely removed this comment but I feel we should write
> something similar to it especially for the part: "Existing logical
> slots on the standby get invalidated when this WAL record is replayed;
> and further, slot creation fails when wal_level is not sufficient; but
> all these operations are not synchronized, so a logical slot may creep
> in while the wal_level is being reduced. Hence this extra check." Did
> anything change about this part of the comment?
>
> 4.
> WaitLSN "Waiting to read or update shared Wait-for-LSN state."
> +LogicalDecodingControl "Waiting to access logical decoding status information."
>
> Seeing the description just above, won't it be correct to say:"Waiting
> to read or update logical decoding status information."?

Fixed the above points.

>
> 5. The newly added test took approximately 8s on my machine, whereas
> other similar tests normally took 2-6s on the same machine, though
> there are some exceptions, such as 035_standby_logical_decoding.pl.
> See below results of some of the tests:
> -------
> [10:03:37] t/028_pitr_timelines.pl ............... ok 2254 ms (
> 0.00 usr 0.00 sys + 0.39 cusr 0.83 csys = 1.22 CPU)
> [10:03:39] t/029_stats_restart.pl ................ ok 2915 ms (
> 0.00 usr 0.00 sys + 0.34 cusr 0.42 csys = 0.76 CPU)
> [10:03:42] t/030_stats_cleanup_replica.pl ........ ok 2282 ms (
> 0.00 usr 0.00 sys + 0.42 cusr 0.66 csys = 1.08 CPU)
> [10:03:45] t/031_recovery_conflict.pl ............ ok 2705 ms (
> 0.00 usr 0.00 sys + 0.39 cusr 0.64 csys = 1.03 CPU)
> [10:03:47] t/032_relfilenode_reuse.pl ............ ok 2611 ms (
> 0.01 usr 0.00 sys + 0.37 cusr 0.61 csys = 0.99 CPU)
> [10:03:50] t/033_replay_tsp_drops.pl ............. ok 4860 ms (
> 0.00 usr 0.00 sys + 0.57 cusr 1.60 csys = 2.17 CPU)
> [10:03:55] t/034_create_database.pl .............. ok 922 ms (
> 0.00 usr 0.00 sys + 0.19 cusr 0.19 csys = 0.38 CPU)
> [10:03:56] t/035_standby_logical_decoding.pl ..... ok 10899 ms (
> 0.01 usr 0.00 sys + 1.13 cusr 2.21 csys = 3.35 CPU)
> [10:04:07] t/036_truncated_dropped.pl ............ ok 1781 ms (
> 0.00 usr 0.00 sys + 0.21 cusr 0.22 csys = 0.43 CPU)
> [10:04:09] t/037_invalid_database.pl ............. ok 944 ms (
> 0.00 usr 0.00 sys + 0.19 cusr 0.21 csys = 0.40 CPU)
> [10:04:09] t/038_save_logical_slots_shutdown.pl .. ok 1562 ms (
> 0.00 usr 0.00 sys + 0.21 cusr 0.36 csys = 0.57 CPU)
> [10:04:11] t/039_end_of_wal.pl ................... ok 4638 ms (
> 0.00 usr 0.00 sys + 0.48 cusr 0.66 csys = 1.14 CPU)
> [10:04:16] t/040_standby_failover_slots_sync.pl .. ok 7418 ms (
> 0.01 usr 0.00 sys + 0.81 cusr 1.82 csys = 2.64 CPU)
> [10:04:23] t/041_checkpoint_at_promote.pl ........ ok 1535 ms (
> 0.00 usr 0.00 sys + 0.29 cusr 0.51 csys = 0.80 CPU)
> [10:04:25] t/042_low_level_backup.pl ............. ok 2842 ms (
> 0.00 usr 0.00 sys + 0.37 cusr 0.66 csys = 1.03 CPU)
> [10:04:27] t/043_no_contrecord_switch.pl ......... ok 1946 ms (
> 0.00 usr 0.00 sys + 0.32 cusr 0.69 csys = 1.01 CPU)
> [10:04:29] t/044_invalidate_inactive_slots.pl .... ok 603 ms (
> 0.00 usr 0.00 sys + 0.19 cusr 0.17 csys = 0.36 CPU)
> [10:04:30] t/045_archive_restartpoint.pl ......... ok 4324 ms (
> 0.00 usr 0.00 sys + 0.97 cusr 0.66 csys = 1.63 CPU)
> [10:04:34] t/046_checkpoint_logical_slot.pl ...... ok 3322 ms (
> 0.00 usr 0.00 sys + 0.33 cusr 0.55 csys = 0.88 CPU)
> [10:04:38] t/047_checkpoint_physical_slot.pl ..... ok 1919 ms (
> 0.00 usr 0.00 sys + 0.28 cusr 0.43 csys = 0.71 CPU)
> [10:04:40] t/048_vacuum_horizon_floor.pl ......... ok 1413 ms (
> 0.01 usr 0.00 sys + 0.26 cusr 0.53 csys = 0.80 CPU)
> [10:04:41] t/049_wait_for_lsn.pl ................. ok 6851 ms (
> 0.00 usr 0.00 sys + 0.40 cusr 0.71 csys = 1.11 CPU)
> [10:04:48] t/050_effective_wal_level.pl .......... ok 8106 ms (
> 0.00 usr 0.00 sys + 0.83 cusr 1.79 csys = 2.62 CPU)
> ---------
>
> I haven't investigated to see if we can optimize or reduce the test
> timing without impacting the coverage or functionality, but just see
> if we can reduce it. If you think we can't do anything on this front
> without compromising functionality coverage, then I think we can live
> with it.

I guess that we cannot avoid making this test heavy to some extent
given that it involves multiple replication setup, standby promotions,
and injection points etc. I've reduced several tests and I hope it
helped reduce test duration on your env. It has been reduced a bit on
my env but the test time is unstable.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Bertrand Drouvot 2025-12-01 06:49:25 Re: Remove unused function parameters, part 2: replication
Previous Message Tom Lane 2025-12-01 06:33:19 Re: UPDATE run check constraints for affected columns only