RE: [Proposal] Add foreign-server health checks infrastructure

From: "Hayato Kuroda (Fujitsu)" <kuroda(dot)hayato(at)fujitsu(dot)com>
To: 'Önder Kalacı' <onderkalaci(at)gmail(dot)com>
Cc: Andres Freund <andres(at)anarazel(dot)de>, Fujii Masao <masao(dot)fujii(at)oss(dot)nttdata(dot)com>, "pgsql-hackers(at)lists(dot)postgresql(dot)org" <pgsql-hackers(at)lists(dot)postgresql(dot)org>, "Shinya11(dot)Kato(at)oss(dot)nttdata(dot)com" <Shinya11(dot)Kato(at)oss(dot)nttdata(dot)com>, "zyu(at)yugabyte(dot)com" <zyu(at)yugabyte(dot)com>, Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>
Subject: RE: [Proposal] Add foreign-server health checks infrastructure
Date: 2022-11-02 02:43:04
Message-ID: TYAPR01MB58668728393648C2F7DC7C85F5399@TYAPR01MB5866.jpnprd01.prod.outlook.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Dear Önder, all,

Thank you for responding and sorry for late response.

> A transaction failing with a bad error message (or holding some resources
> locally until the transaction is committed) doesn't sound essential to me.
> Is there any specific workload are you referring for optimizing to rollback
> a transaction earlier if a remote server dies? What kind of workload would
> benefit from that? Maybe there is, but not clear to me and haven't seen
> discussed on the thread (sorry if I missed).

I (and my company) worried about overnight batch processing that
contains some accesses to foreign servers. If the transaction is opened overnight and
one of foreign servers is crashed during it, the transaction must be rollbacked.
But there is a possibility that DBAs do not recognize the crash and
they waste a time until the morning. This problem may affect customer's business.
(It may not be sufficient to check the status from another different server.
DBAs must check the network between the databases, and they may be oversight.)
This is a motivation we thought.

> I'm trying to understand if we are trying to solve a problem that does not
> really exists. I'm bringing this up, because I often deal with
> architectures where there is a local node and remote transaction on
> different Postgres servers. And, I have not encountered many (or any)
> pattern that'd benefit from this change much. In fact, I think, on the
> contrary, this might add significant overhead for OLTP type of high query
> throughput systems.

As I said above, I did not considered about OLTP system. But I agreed that the current
callback mechanism may have significant overhead.

Actually, we may not decide the correct way to detect the failure now.
Your argument, which operations should be done by BGworker and we record stats about checking,
seems to be efficient and may be smarter but this may be not match my motivation now.
My approach may have large overhead and may be not able to use for OLTP system.

So how about implementing a check function as an SQL function once and update incrementally?
This still satisfy our motivation and it can avoid overhead because we can reduce the number of calling it.
If we decide that we establish a new connection in the checking function, we can refactor the it.
And if we decide that we introduce health-check BGworker, then we can add a process that calls implemented function periodically.

PSA patchset that implemented as an SQL function. I moved the checking function to libpq layer, fe-misc.c.
Note that poll() is used here, it means that currently this function can be used on some limited platforms.

I have added a parameter check_all that controls the scope of to-be-checked servers,
But this is not related with my motivation so we can remove if not needed.

(I have not implemented another version that uses epoll() or kqueue(),
because they seem to be not called on the libpq layer. Do you know any reasons?)

How do you think?

Best Regards,
Hayato Kuroda
FUJITSU LIMITED

Attachment Content-Type Size
v18-0001-Add-PQConncheck-to-libpq.patch application/octet-stream 4.0 KB
v18-0002-postgres_fdw-add-postgres_fdw_verify_foreign_ser.patch application/octet-stream 4.9 KB
v18-0003-add-test.patch application/octet-stream 1.1 MB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Masahiko Sawada 2022-11-02 02:50:01 Re: Perform streaming logical transactions by background workers and parallel apply
Previous Message David G. Johnston 2022-11-02 01:59:30 Re: Glossary and initdb definition work for "superuser" and database/cluster