RE: [Proposal] Add foreign-server health checks infrastructure

From: "kuroda(dot)hayato(at)fujitsu(dot)com" <kuroda(dot)hayato(at)fujitsu(dot)com>
To: 'Kyotaro Horiguchi' <horikyota(dot)ntt(at)gmail(dot)com>
Cc: "pgsql-hackers(at)lists(dot)postgresql(dot)org" <pgsql-hackers(at)lists(dot)postgresql(dot)org>, "Shinya11(dot)Kato(at)oss(dot)nttdata(dot)com" <Shinya11(dot)Kato(at)oss(dot)nttdata(dot)com>, "zyu(at)yugabyte(dot)com" <zyu(at)yugabyte(dot)com>, "masao(dot)fujii(at)oss(dot)nttdata(dot)com" <masao(dot)fujii(at)oss(dot)nttdata(dot)com>
Subject: RE: [Proposal] Add foreign-server health checks infrastructure
Date: 2022-02-17 04:11:09
Message-ID: TYAPR01MB58661B088B6066282824ADD9F5369@TYAPR01MB5866.jpnprd01.prod.outlook.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Dear Horiguchi-san,

Thank you for giving your suggestions. I want to confirm your saying.

> FWIW, I'm not sure this feature necessarily requires core support
> dedicated to FDWs. The core have USER_TIMEOUT feature already and
> FDWs are not necessarily connection based. It seems better if FDWs
> can implement health check feature without core support and it seems
> possible. Or at least the core feature should be more generic and
> simpler. Why don't we just expose InTransactionHealthCheckCallbacks or
> something and operating functions on it?

I understood that core is too complicated and FDW side is too stupid, right?

> Mmm. AFAICS the running command will stop with "canceling statement
> due to user request", which is a hoax. We need a more decent message
> there.

+1 about better messages.

> I understand that the motive of this patch is "to avoid wasted long
> local work when fdw-connection dies".

Yeah your understanding is right.

> In regard to the workload in
> your first mail, it is easily avoided by ending the transaction as soon
> as remote access ends. This feature doesn't work for the case "begin;
> <long local query>; <fdw access>". But the same measure also works in
> that case. So the only case where this feature is useful is "begin;
> <fdw-access>; <some long work>; <fdw-access>; end;". But in the first
> place how frequently do you expecting remote-connection close happens?
> If that happens so frequently, you might need to recheck the system
> health before implementing this feature. Since it is correctly
> detected when something really went wrong, I feel that it is a bit too
> complex for the usefulness especially for the core part.

Thanks for analyzing motivation.
Indeed, some cases may be resolved by separating tx and this event rarely happens.

> In conclusion, as my humble opinion I would like to propose to reduce
> this feature to:
>
> - Just periodically check health (in any aspect) of all live
> connections regardless of the session state.

I understood here as removing following mechanism from core:

* disable timeout at end of tx.
* skip if held off or read commands

> - If an existing connection is found to be dead, just try canceling
> the query (or sending query cancel).
> One issue with it is how to show the decent message for the query
> cancel, but maybe we can have a global variable that suggests the
> reason for the cancel.

Currently I have no good idea for that but I'll try.

Best Regards,
Hayato Kuroda
FUJITSU LIMITED

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Nathan Bossart 2022-02-17 04:14:04 Re: O(n) tasks cause lengthy startups and checkpoints
Previous Message Andres Freund 2022-02-17 04:03:00 Re: Race conditions in 019_replslot_limit.pl