From: | Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com> |
---|---|
To: | "Hayato Kuroda (Fujitsu)" <kuroda(dot)hayato(at)fujitsu(dot)com> |
Cc: | Shlok Kyal <shlok(dot)kyal(dot)oss(at)gmail(dot)com>, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Bertrand Drouvot <bertranddrouvot(dot)pg(at)gmail(dot)com>, shveta malik <shveta(dot)malik(at)gmail(dot)com>, Ashutosh Bapat <ashutosh(dot)bapat(dot)oss(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: POC: enable logical decoding when wal_level = 'replica' without a server restart |
Date: | 2025-09-04 18:15:21 |
Message-ID: | CAD21AoB4txbyZm=8+pm9-o4XR8dCuzXdTP=3Ymr5k8T6YPjWMg@mail.gmail.com |
Views: | Whole Thread | Raw Message | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
od On Tue, Sep 2, 2025 at 8:11 PM Hayato Kuroda (Fujitsu)
<kuroda(dot)hayato(at)fujitsu(dot)com> wrote:
>
> Dear Sawada-san,
>
> Here are my comments.
>
> 01.
> ```
> checkPoint.logicalDecodingEnabled = IsLogicalDecodingEnabled();
> ```
>
> Per my analysis, the value is always false here because StartupLogicalDecodingStatus
> is not called yet. Can we use "false" directly?
I think that it's better to read the shared flag instead of directly
setting false since LogicalDecodingCtl is already initialized.
>
> 02.
> ```
> elog(DEBUG1, "waiting for %d transactions to complete", running->xcnt);
> ```
>
> Here plural form is always used even if the running transaction is only one.
> How about something like:
> ```
> Number of transactions to wait finishing: %d
> ```
Hmm, not sure it improves the message. I think we don't care much
about plurals in debug messages. And in our convention the main log
message doesn't start a capital character.
>
> 03.
> ```
> while (RecoveryInProgress())
> {
> pgstat_report_wait_start(WAIT_EVENT_LOGICAL_DECODING_STATUS_CHANGE_DELAY);
> pg_usleep(100000L); /* wait for 100 msec */
> pgstat_report_wait_end();
> }
> ```
>
> I found a stuck case here: if a backend process within the loop and startup waits
> a signal is processed, both of them can stuck. The backend waits the recovery
> state to be DONE, and the startup waits all processes handle consume the signal.
> IIUC we must add CHECK_FOR_INTERRUPTS() or ProcessProcSignalBarrier().
>
> Actual steps:
>
> 0. constructed a streaming replication system, which the only primary server had
> a logical slot. I.e., the effective_wal_level was logical.
> 1. connected to a standby node
> 2. attached to the backend process via gdb
> 3. added a breakpoint at create_logical_replication_slot
> 4. called pg_create_logical_replication_slot() on the backend.
> the backend will stop before ReplicationSlotCreate().
> 5. from another terminal, attached to the startup process via gdb
> 6. added a breakpoint at UpdateLogicalDecodingStatusEndOfRecovery()
> 7. from another terminal, send a promote signal to the standby.
> The startup will stop at UpdateLogicalDecodingStatusEndOfRecovery()
> 8. executed steps on startup process, untill delay_status_change was updated
> and LogicalDecodingControlLock was released.
> 9. detached from the backend process. It would stop at the loop in
> start_logical_decoding_status_change().
> 10. detached from the startup process. It would wait all processes handled the
> signal, but the backend won't do.
Good find! I'll fix the problem by adding CHECK_FOR_INTERRUPTS() as
you suggested.
Regards,
--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com
From | Date | Subject | |
---|---|---|---|
Next Message | Sami Imseih | 2025-09-04 18:23:26 | Re: GetNamedLWLockTranche crashes on Windows in normal backend |
Previous Message | Emre Hasegeli | 2025-09-04 18:02:02 | Allow using replication origins in SQL level parallel sessions |