| From: | Chao Li <li(dot)evan(dot)chao(at)gmail(dot)com> |
|---|---|
| To: | Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com> |
| Cc: | Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Postgres hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org> |
| Subject: | Re: Is it OK to perform logging while holding a LWLock? |
| Date: | 2026-03-04 00:44:17 |
| Message-ID: | 42B4228F-EA62-4095-9EBD-C61786E2B0E7@gmail.com |
| Views: | Whole Thread | Raw Message | Download mbox | Resend email |
| Thread: | |
| Lists: | pgsql-hackers |
> On Mar 4, 2026, at 07:37, Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com> wrote:
>
> On Tue, Mar 3, 2026 at 2:49 PM Chao Li <li(dot)evan(dot)chao(at)gmail(dot)com> wrote:
>>
>>
>>
>>> On Mar 4, 2026, at 04:48, Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com> wrote:
>>>
>>> Hi,
>>>
>>> On Wed, Feb 11, 2026 at 5:07 PM Chao Li <li(dot)evan(dot)chao(at)gmail(dot)com> wrote:
>>>>
>>>>
>>>>
>>>>> On Feb 11, 2026, at 11:19, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>>>>>
>>>>> Chao Li <li(dot)evan(dot)chao(at)gmail(dot)com> writes:
>>>>>> As $SUBJECT says, my understanding is no.
>>>>>
>>>>> It's not a great idea, but I'm not sure it's fatal. There are places
>>>>> that hold LWLocks for awhile.
>>>>>
>>>>>> I think LWLocks are generally only held for a very short duration,
>>>>>> like a few CPU cycles to read or modify some shared data,
>>>>>
>>>>> Spinlocks are treated that way, but we're willing to hold LWLocks
>>>>> longer. The main thing I'd be concerned about is that there is no
>>>>> deadlock-detection infrastructure for LWLocks, so you'd better be
>>>>> darn certain there is no possibility of deadlock. That usually
>>>>> means you want to limit the extent of code that could run while
>>>>> you're holding the lock.
>>>>>
>>>>> In your specific example, the thing I'd be afraid of is that an
>>>>> errcontext callback might do something you're not expecting.
>>>>> We don't forbid errcontext callbacks from doing catalog lookups,
>>>>> for instance. So on the whole I agree with this patch, with
>>>>> or without any concrete example that fails.
>>>>>
>>>>> regards, tom lane
>>>>
>>>> As Tom has agreed with this patch, added to CF for tracking: https://commitfest.postgresql.org/patch/6477/
>>>>
>>>
>>> While I agree with the concern Tom pointed out, I can find that there
>>> are other places where we do ereport() while holding a lwlock. For
>>> instance:
>>>
>>> src/backend/access/transam/multixact.c:
>>> else if (!find_multixact_start(newOldestMulti, &newOldestOffset))
>>> {
>>> ereport(LOG,
>>> (errmsg("cannot truncate up to MultiXact %u because it
>>> does not exist on disk, skipping truncation",
>>> newOldestMulti)));
>>> LWLockRelease(MultiXactTruncationLock);
>>> return;
>>> }
>>>
>>> src/backend/commands/vacuum.c:
>>> if (frozenAlreadyWrapped)
>>> {
>>> ereport(WARNING,
>>> (errmsg("some databases have not been vacuumed in over
>>> 2 billion transactions"),
>>> errdetail("You might have already suffered
>>> transaction-wraparound data loss.")));
>>> LWLockRelease(WrapLimitsVacuumLock);
>>> return;
>>> }
>>>
>>> src/backend/postmaster/autovacuum.c:
>>> LWLockAcquire(AutovacuumLock, LW_EXCLUSIVE);
>>>
>>> /*
>>> * No other process can put a worker in starting mode, so if
>>> * startingWorker is still INVALID after exchanging our lock,
>>> * we assume it's the same one we saw above (so we don't
>>> * recheck the launch time).
>>> */
>>> if (AutoVacuumShmem->av_startingWorker != NULL)
>>> {
>>> worker = AutoVacuumShmem->av_startingWorker;
>>> worker->wi_dboid = InvalidOid;
>>> worker->wi_tableoid = InvalidOid;
>>> worker->wi_sharedrel = false;
>>> worker->wi_proc = NULL;
>>> worker->wi_launchtime = 0;
>>> dclist_push_head(&AutoVacuumShmem->av_freeWorkers,
>>> &worker->wi_links);
>>> AutoVacuumShmem->av_startingWorker = NULL;
>>> ereport(WARNING,
>>> errmsg("autovacuum worker took too long to
>>> start; canceled"));
>>> }
>>>
>>> Should we fix these places as well?
>>
>> Maybe. If needed, I can take a look at them one by one.
>>
>>>
>>> Also, if we reverse the ereport() and LWLockRelease() in the specific
>>> example in logicalctl.c, it would happen that a concurrent logical
>>> decoding activation writes the log "logical decoding is enabled upon
>>> creating a new logical replication slot" before the deactivation
>>> "logical decoding is disabled because there are no valid logical
>>> replication slots", confusing users since the logical decoding is
>>> active even though the last log saying "logical decoding is disabled".
>>>
>>
>> That sounds reasonable. However, looking at the current code, the “enabling” log is printed after releasing the lock:
>> ```
>> LWLockRelease(LogicalDecodingControlLock);
>>
>> END_CRIT_SECTION();
>>
>> if (!in_recovery)
>> ereport(LOG,
>> errmsg("logical decoding is enabled upon creating a new logical replication slot"));
>> ```
>>
>> So the log order is not currently protected by the lock. If we really want to ensure the ordering between these two messages, we might instead need to move the “enabling” log before releasing the lock.
>
> No, IIUC activation happens while a valid logical replication slot is
> held. Since deactivation requires that no slots are present, there's
> no risk of a concurrent deactivation occurring between the
> LWLockRelease() and the ereport().
>
>> I wonder what you think would be the best way to proceed here.
>
> Thinking more about this: while the fix might be applicable elsewhere,
> it seems unnecessary here. The deactivation process is restricted to
> the startup and checkpointer processes. Given that these processes
> don't access system catalogs for example, I think there is little or
> zero risk of a deadlock occurring even if we do something in
> errcontext.
>
> Regards,
>
> --
> Masahiko Sawada
> Amazon Web Services: https://aws.amazon.com
Thanks for the explanations. In that case, I will withdraw this patch.
Best regards,
--
Chao Li (Evan)
HighGo Software Co., Ltd.
https://www.highgo.com/
| From | Date | Subject | |
|---|---|---|---|
| Next Message | Xuneng Zhou | 2026-03-04 00:49:44 | Re: amcheck: fix bug of missing corruption in allequalimage validation |
| Previous Message | Andrew Dunstan | 2026-03-04 00:37:48 | Re: pg_waldump: support decoding of WAL inside tarfile |