Re: Always show correct error message for statement timeouts, fixes random buildfarm failures

From: Jelte Fennema-Nio <me(at)jeltef(dot)nl>
To: PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Alexander Lakhin <exclusion(at)gmail(dot)com>
Subject: Re: Always show correct error message for statement timeouts, fixes random buildfarm failures
Date: 2025-12-07 14:32:56
Message-ID: CAGECzQSP17h9nB1LDndKRk4MXEjC17CLyZbVKbLNz39CjL7Rcw@mail.gmail.com
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, 18 Feb 2025 at 10:29, Jelte Fennema-Nio <me(at)jeltef(dot)nl> wrote:
> > So I think the solution in 0002, while debatable if strictly correct,
> > is the only fix that we can easily do. Also I personally believe the
> > behaviour resulting from 0002 is totally correct: The new behaviour
> > would be that if a timeout occurred, right before it was disabled or
> > reset, but the interrupt was not processed yet, then we process that
> > timeout as normal. That seems totally reasonable behaviour to me from
> > the perspective of an end user: You get a timeout error when the
> > timeout occurred before the timeout was disabled/reset.

I still think there's no reason for this not to be committed, as it
fixes both a bug and a flaky test.

On Tue, 18 Feb 2025 at 10:29, Jelte Fennema-Nio <me(at)jeltef(dot)nl> wrote:
>
> (was part of "Add non-blocking version of PQcancel" thread[1])
>
> The commitfest app notified me that this patch needed a rebase. While
> doing that I'm also moving this patch out of the original thread it
> was part of. It fixes a bug that causes this buildfarm failure[2]. In
> rare cases the current code would report "canceling statement due to
> user request", even though it was actually canceled due to a statement
> timeout, so it should have this error message: "canceling statement
> due to statement timeout". A way to reproduce this problem locally was
> provided by Alexander[3].
>
> I took it out of the original thread because that has become
> impossible to follow (at least for me, so I expect also for others)
> due to discussing 3 different random failures involving cancel
> requests. And this specific one doesn't even have anything to do with
> the original non-blocking PQCancel work.
>
> I have inlined the parts of emails that are relevant to this patch
> below. That way others don't have to dissect that thread. (for context
> this patch is often, but not always, called patch 0002 in the original
> thread).
>
> My initial email containing the patch[4]:
>
> On Mon, 24 Jun 2024 at 00:59, Jelte Fennema-Nio <postgres(at)jeltef(dot)nl> wrote:
> > On Sat, 22 Jun 2024 at 17:00, Alexander Lakhin <exclusion(at)gmail(dot)com> wrote:
> > > I also came across another failure of the test:
> > > @@ -2774,7 +2774,7 @@
> > > BEGIN;
> > > SET LOCAL statement_timeout = '10ms';
> > > select count(*) from ft1 CROSS JOIN ft2 CROSS JOIN ft4 CROSS JOIN ft5; -- this takes very long
> > > -ERROR: canceling statement due to statement timeout
> > > +ERROR: canceling statement due to user request
> > > COMMIT;
> > >
> > > which is reproduced with a sleep added here:
> > > @@ -1065,6 +1065,7 @@ exec_simple_query(const char *query_string)
> > > */
> > > parsetree_list = pg_parse_query(query_string);
> > > +pg_usleep(11000);
> >
> > After investigating, I realized this actually exposes a bug in our
> > statement timeout logic. It has nothing to do with posgres_fdw and
> > reproduces with any regular postgres query too. Attached is a patch
> > that fixes this issue. This one should probably be backported.
>
>
> Tom's review[5]:
>
> On Fri, 30 Aug 2024 at 22:12, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> > I think 0002 is probably outright wrong, or at least the change to
> > disable_statement_timeout is. Once we get to that, we don't want
> > to throw a timeout error any more, even if an interrupt was received
> > just before it.
>
> My initial response to that review[6]:
>
> On Fri, 30 Aug 2024 at 23:24, Jelte Fennema-Nio <postgres(at)jeltef(dot)nl> wrote:
> > The disable_statement_timeout change was not the part of that patch
> > that was necessary for stable output, only the change in the first
> > branch of enable_statement_timeout was necessary. The reason being
> > that enable_statement_timeout is called multiple times for a query,
> > because start_xact_command is called multiple times in
> > exec_simple_query. The change to disable_statement_timeout just seemed
> > like the logical extension of that change, especially since there was
> > basically a verbatim copy of disable_statement_timeout in the second
> > branch of enable_statement_timeout.
> >
> > To make sure I understand your suggestion correctly: Are you saying
> > you would want to completely remove the outstanding interrupt if it
> > was caused by de statement_timout when disable_statement_timeout is
> > called? Because I agree that would probably make sense, but that
> > sounds like a more impactful change. But the current behaviour seems
> > strictly worse than the behaviour proposed in the patch to me, because
> > currently the backend would still be interrupted, but the error would
> > indicate a reason for the interrupt that is simply incorrect i.e. it
> > will say it was cancelled due to a user request, which never happened.
>
> My second response to that review[7]
>
> On Sat, 31 Aug 2024 at 09:08, Jelte Fennema-Nio <postgres(at)jeltef(dot)nl> wrote:
> > I also took a quick look at the code again, and completely removing
> > the outstanding interrupt seems hard to do. Because there's no way to
> > know if there were multiple causes for the interupt, i.e. someone
> > could have pressed ctrl+c as well and we wouldn't want to undo that.
> >
> > So I think the solution in 0002, while debatable if strictly correct,
> > is the only fix that we can easily do. Also I personally believe the
> > behaviour resulting from 0002 is totally correct: The new behaviour
> > would be that if a timeout occurred, right before it was disabled or
> > reset, but the interrupt was not processed yet, then we process that
> > timeout as normal. That seems totally reasonable behaviour to me from
> > the perspective of an end user: You get a timeout error when the
> > timeout occurred before the timeout was disabled/reset.
>
>
> [1]: https://www.postgresql.org/message-id/flat/AM5PR83MB0178D3B31CA1B6EC4A8ECC42F7529(at)AM5PR83MB0178(dot)EURPRD83(dot)prod(dot)outlook(dot)com
> [2]: https://wiki.postgresql.org/wiki/Known_Buildfarm_Test_Failures#timeouts.spec_failed_because_of_statement_cancelled_due_to_unexpected_reason
> [3]: https://www.postgresql.org/message-id/4db099c8-4a52-3cc4-e970-14539a319466%40gmail.com#a81203796f7257531d037e2ca6a0ac43
> [4]: https://www.postgresql.org/message-id/CAGECzQSBp%3DsCcMCy1txig7cV0VX9hDDb1UdZt3zpDpZe2X2K-Q%40mail.gmail.com#a8d453bddc52f0db0b0fa80676d352c5
> [5]: https://www.postgresql.org/message-id/584811.1725048717%40sss.pgh.pa.us#030de114b9a6a3eafd082a92fb95fa6e
> [6]: https://www.postgresql.org/message-id/CAGECzQR2HqQTFUddz-Q6y4sfMtTA%2BLRE7WqJtpPSrJKKUXtH7w%40mail.gmail.com#702f7a3fe3936b30bdaad0c5264f5a08
> [7]: https://www.postgresql.org/message-id/CAGECzQQM1N0H3e9RbSqi4%3DmmFgfLaFNr6R7gNpFoePb9exRryg%40mail.gmail.com#f11b82da794d5996eca82339c2a17caa

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Dave Cramer 2025-12-07 14:37:34 Proposal to allow setting cursor options on Portals
Previous Message Sugamoto Shinya 2025-12-07 14:32:28 Re: [PATCH] Add error hints for invalid COPY options