Re: Always show correct error message for statement timeouts, fixes random buildfarm failures

From: Japin Li <japinli(at)hotmail(dot)com>
To: "Jelte Fennema-Nio" <postgres(at)jeltef(dot)nl>
Cc: "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>, "Akshat Jaimini" <destrex271(at)gmail(dot)com>, <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: Always show correct error message for statement timeouts, fixes random buildfarm failures
Date: 2026-01-10 01:34:33
Message-ID: SYAPR01MB30382BDFBE6D6B0D514F8F78B683A@SYAPR01MB3038.ausprd01.prod.outlook.com
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers


Hi, Jelte

On Fri, 09 Jan 2026 at 23:08, "Jelte Fennema-Nio" <postgres(at)jeltef(dot)nl> wrote:
> On Thu, 8 Jan 2026 at 18:50, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>> I took another look and I still don't like it. The intent of the code
>> is that once we're out of a query, a query cancel that arrived
>> meanwhile should be a no-op. I completely disagree with the idea that
>> a late-arriving cancel should cancel the next query. (From the user's
>> standpoint: if you're a bit slow to hit control-C and so your query
>> finishes anyway, and then you go for coffee, would you be surprised
>> to come back awhile later and enter a new query and have it fail?)
>> And I think the same should apply to statement timeouts: if one
>> happens slightly too late to cancel the current query, that does not
>> mean it should cancel the next one.
>
> The fact that you're thinking that my intent of the patch was this, that
> means I did a terrible job at explaining the bug that this patch is
> fixing. So I'll try doing a better job now.
>
> The bug this is fixing has nothing to do with multiple queries. It
> occurs within a single query being processed by exec_simple_query:
> 1. At line 1051 start_xact_command is called and then calls
> enable_statement_timeout to start the timeout for the query.
> 2. The code continues to line 1148 where start_xact_command is called AGAIN.
> 3. Now the timeout occurs (could also happen a bit earlier, as long as
> no CHECK_FOR_INTERRUPTS is called). This sets the statement timeout
> indicator and calls remove_timeout_index to mark the timeout as
> inactive.
> 4. Now the second start_xact_command calls enable_statement_timeout AGAIN.
> 5. Because get_timeout_active returns false now due to the timeout,
> enable_timeout_after is called again which resets the indicator flag
> 6. New CHECK_FOR_INTERRUPTS is called wherever in the query and
> QueryCancelPending is still set (correctly), but the matching
> indicator flag was lost.
>
> The problem is that while the comment above 4 shows that the intent of
> calling enable_statement_timeout again without an interceding
> finish_xact_command is a no-op:
>
> /*
> * Start statement timeout if necessary. Note that this'll intentionally
> * not reset the clock on an already started timeout, to avoid the timing
> * overhead when start_xact_command() is invoked repeatedly, without an
> * interceding finish_xact_command() (e.g. parse/bind/execute). If that's
> * not desired, the timeout has to be disabled explicitly.
> */
>
> But that's not actually the case if you run into this race where the
> timeout has already fired, because then It resets the just-before-set
> indicator flag. Note that the comment also explains that this is not the
> only place where this can happen.
>
>> /*
>> * (5) disable async signal conditions again.
>> *
>> * Query cancel is supposed to be a no-op when there is no query in
>> * progress, so if a query cancel arrived while we were idle, just
>> * reset QueryCancelPending. ProcessInterrupts() has that effect when
>> * it's called when DoingCommandRead is set, so check for interrupts
>> * before resetting DoingCommandRead.
>> */
>> CHECK_FOR_INTERRUPTS();
>> DoingCommandRead = false;
>>
>> We are taking care to discard a stale QueryCancelPending flag, but
>> this code is ignorant of the fact that timeouts now also feed into
>> QueryCancelPending. Perhaps we should clear the timeout indicators
>> here? Or maybe using CHECK_FOR_INTERRUPTS here is too cute and
>> we should just clear QueryCancelPending directly (and clear the
>> indicators too, likely).
>
> Thank you for pointing this out. I think (but I'm not exactly sure what
> I was thinking over a year ago), that I had trusted the comment there,
> and I had missed that my changes had inadvertently changed the behaviour
> here because now the indicators could still be set at this
> CHECK_FOR_INTERRUPTS call.
>
> Notably CHECK_FOR_INTERRUPTS is already clearing the indicators (your
> 2nd proposed option), because it's calling get_timeout_indicator with
> the true flag. Though that's not enough, because it still throws an
> error if they are set even if DoingCommandRead=true. So the newly
> attached patch changes that behaviour instead by checking if
> DoingCommandRead is false before throwing any error. Does that resolve
> your concerns about the possibility of cancels leaking to the next
> query? (which to be extremely clear, was unintended)
>
>> Also, I could buy making the proposed change in
>> disable_statement_timeout without the one in enable_statement_timeout.
>> I think that would have the desired effect that if ProcessInterrupts
>> runs shortly after disable_statement_timeout it would not lie about
>> the cause of the cancel, while not sacrificing the principle that
>> a statement timeout can't kill the next statement.
>
> As explained above the one in enable_statement_timeout is exactly the
> one that is required to fix the bug in question.
>
>> As a more thorough redesign, maybe we should get rid of the
>> idea that QueryCancelPending represents all three of these
>> conditions. That would mean a bit more duplication of logic
>> in ProcessInterrupts, but I don't think that code is especially
>> performance-critical.
>
> I would like that very much, as I think it would make the code much
> easier to follow (I think it's pretty hard to understand code
> currently). The problem I see with actually doing that though, is that
> StatementTimeoutHandler is sending SIGINT to the whole process group. So
> unless we want to stop doing that, I don't see a way of separating the
> indicators from QueryCancelPending.

I recently encountered this error [0]. I'm wondering if we could add a test
case to cover this situation.

[0] https://www.postgresql.org/message-id/MEAPR01MB30313EF4A7C22EA3495A861BB682A@MEAPR01MB3031.ausprd01.prod.outlook.com

--
Regards,
Japin Li
ChengDu WenWu Information Technology Co., Ltd.

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Andres Freund 2026-01-10 01:42:36 Re: Adding basic NUMA awareness
Previous Message Fujii Masao 2026-01-10 00:44:36 Re: pg_recvlogical: Prevent flushed data from being re-sent after restarting replication