Re: Fix a race condition in ConditionVariableTimedSleep()

From: Yura Sokolov <y(dot)sokolov(at)postgrespro(dot)ru>
To: Bertrand Drouvot <bertranddrouvot(dot)pg(at)gmail(dot)com>, pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject: Re: Fix a race condition in ConditionVariableTimedSleep()
Date: 2025-05-05 11:15:15
Message-ID: 2143012f-ac50-4fd2-9697-63b41484713a@postgrespro.ru
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

05.05.2025 10:52, Bertrand Drouvot пишет:
> Hi hackers,
>
> While working on wait event related stuff I observed a failed assertion:
>
> "
> TRAP: failed Assert("node->next == 0 && node->prev == 0"), File: "../../../../src/include/storage/proclist.h", Line: 91
> "
>
> during pg_regress/database.
>
> To reproduce, add an ereport(LOG,..) or CHECK_FOR_INTERRUPTS() or whatever
> would trigger ConditionVariableBroadcast() in pgstat_report_wait_end():

Interestingly, our colleague stepped into same problem recently [1] . It
happened because he attempted to make overcomplex timeout (SIGALARM) handler.

But his solution was a bit different [2].

[1] https://postgr.es/m/076eb7bd-52e6-4a51-ba00-c744d027b15c@postgrespro.ru
[2]
https://postgr.es/m/attachment/175030/0001-CV-correctly-handle-cv_sleep_target-change.patch

And I believe, his solution is more elegant. Doesn't it?

But in first step, I doubt there should be any thing that cancels condition
variable during WaitLatch. Most probably you did wrong thing.

We convinced the colleague to rework the code to not trigger the issue in
first place.

--
regards
Yura Sokolov aka funny-falcon

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Andrei Lepikhov 2025-05-05 11:38:39 Re: MergeAppend could consider sorting cheapest child path
Previous Message Amit Kapila 2025-05-05 10:59:18 Re: Fix slot synchronization with two_phase decoding enabled