Re: [Proposal] Add foreign-server health checks infrastructure

From: Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>
To: masao(dot)fujii(at)oss(dot)nttdata(dot)com
Cc: kuroda(dot)hayato(at)fujitsu(dot)com, pgsql-hackers(at)lists(dot)postgresql(dot)org, Shinya11(dot)Kato(at)oss(dot)nttdata(dot)com, zyu(at)yugabyte(dot)com
Subject: Re: [Proposal] Add foreign-server health checks infrastructure
Date: 2022-02-16 02:50:58
Message-ID: 20220216.115058.1615761004338301624.horikyota.ntt@gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

At Tue, 1 Feb 2022 23:51:54 +0900, Fujii Masao <masao(dot)fujii(at)oss(dot)nttdata(dot)com> wrote in
> This logic sounds complicated to me. I'm afraid that FDW developers
> may a bit easily misunderstand the logic and make the bug in their
> FDW.

> Isn't it simpler to just disable the timeout in core whenever the
> transaction ends whether committed or aborted, like statement_timeout
> is disabled after each command? For example, call something like
> DisableForeignCheckTimeout() in CommitTransaction() etc.

> This approach seems to assume that FDW must manage all the
> ForeignServer information so that the callback can return it. Is this
> assumption valid for all the FDW?

FWIW, I'm not sure this feature necessarily requires core support
dedicated to FDWs. The core have USER_TIMEOUT feature already and
FDWs are not necessarily connection based. It seems better if FDWs
can implement health check feature without core support and it seems
possible. Or at least the core feature should be more generic and
simpler. Why don't we just expose InTransactionHealthCheckCallbacks or
something and operating functions on it?

> How about making FDW trigger a query cancel interrupt by signaling
> SIGINT to the backend, instead?

Mmm. AFAICS the running command will stop with "canceling statement
due to user request", which is a hoax. We need a more decent message
there.

I understand that the motive of this patch is "to avoid wasted long
local work when fdw-connection dies". In regard to the workload in
your first mail, it is easily avoided by ending the transaction as soon
as remote access ends. This feature doesn't work for the case "begin;
<long local query>; <fdw access>". But the same measure also works in
that case. So the only case where this feature is useful is "begin;
<fdw-access>; <some long work>; <fdw-access>; end;". But in the first
place how frequently do you expecting remote-connection close happens?
If that happens so frequently, you might need to recheck the system
health before implementing this feature. Since it is correctly
detected when something really went wrong, I feel that it is a bit too
complex for the usefulness especially for the core part.

In conclusion, as my humble opinion I would like to propose to reduce
this feature to:

- Just periodically check health (in any aspect) of all live
connections regardless of the session state.

- If an existing connection is found to be dead, just try canceling
the query (or sending query cancel).

One issue with it is how to show the decent message for the query
cancel, but maybe we can have a global variable that suggests the
reason for the cancel.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Kyotaro Horiguchi 2022-02-16 03:12:05 Re: [PATCH] Fix out-of-bouds access (src/common/wchar.c)
Previous Message Andres Freund 2022-02-16 02:42:35 Re: adding 'zstd' as a compression algorithm