Re: [Proposal] Add foreign-server health checks infrastructure

From: Önder Kalacı <onderkalaci(at)gmail(dot)com>
To: "Hayato Kuroda (Fujitsu)" <kuroda(dot)hayato(at)fujitsu(dot)com>
Cc: Andres Freund <andres(at)anarazel(dot)de>, Fujii Masao <masao(dot)fujii(at)oss(dot)nttdata(dot)com>, "pgsql-hackers(at)lists(dot)postgresql(dot)org" <pgsql-hackers(at)lists(dot)postgresql(dot)org>, "Shinya11(dot)Kato(at)oss(dot)nttdata(dot)com" <Shinya11(dot)Kato(at)oss(dot)nttdata(dot)com>, "zyu(at)yugabyte(dot)com" <zyu(at)yugabyte(dot)com>, Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>
Subject: Re: [Proposal] Add foreign-server health checks infrastructure
Date: 2022-11-11 10:30:33
Message-ID: CACawEhW56PHQ83Q59x4U5zpi0rVJ6=0Vn-FeYSLZ13Y0yasebQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi Hayota Kuroda,

> I (and my company) worried about overnight batch processing that
> contains some accesses to foreign servers. If the transaction is opened
> overnight and
> one of foreign servers is crashed during it, the transaction must be
> rollbacked.
> But there is a possibility that DBAs do not recognize the crash and
> they waste a time until the morning. This problem may affect customer's
> business.
> (It may not be sufficient to check the status from another different
> server.
> DBAs must check the network between the databases, and they may be
> oversight.)
> This is a motivation we thought.
>

Thanks for the clarification, I agree that this might be a valid concern
for systems.

>
> So how about implementing a check function as an SQL function once and
> update incrementally?
>

I think a SQL function makes sense.

> This still satisfy our motivation and it can avoid overhead because we can
> reduce the number of calling it.
>

Yes, it makes sense. Also, it allows other extensions to utilize the new
libpq function, which is neat.

> If we decide that we establish a new connection in the checking function,
> we can refactor the it.
> And if we decide that we introduce health-check BGworker, then we can add
> a process that calls implemented function periodically.
>

Right, your approach doesn't conflict with a more sophisticated approach,
in fact is a useful building block.

>
> Note that poll() is used here, it means that currently this function can
> be used on some limited platforms.
>
>
I think more experienced hackers could guide us here. I don't see a problem
with that, but checking other functions like pqSocketPoll(), I see that
Postgres uses the select system call. I wonder if we can have something
similar with select? Seems no, but wanted to bring up in case you know more
about that?

> I have added a parameter check_all that controls the scope of
> to-be-checked servers,
> But this is not related with my motivation so we can remove if not needed.
>

I guess it kind of makes sense to have the flexibility to check all
connections vs only in tx connections.

Though, maybe we should follow a similar approach
to postgres_fdw_disconnect(servername) / postgres_fdw_disconnect_all()

postgres_fdw_verify_connection_states(servername) /
postgres_fdw_verify_connection_states_all()

That seems like a more natural API considering the other UDFs. You can also
use in combination with postgres_fdw_get_connections()

> (I have not implemented another version that uses epoll() or kqueue(),
> because they seem to be not called on the libpq layer. Do you know any
> reasons?)
>
>
Hmm, I don't know the reason. Is that maybe epoll is available on a smaller
number of platforms and libpq can be used on more platforms as being a
client library?

Now, some comments regarding the v18:

>
> +static int
> +pqConncheck_internal(int sock)
> +{
> +#if (defined(HAVE_POLL) && defined(POLLRDHUP))
> + struct pollfd input_fd;
> +
> + input_fd.fd = sock;
> + input_fd.events = POLLRDHUP;
> + input_fd.revents = 0;
> +
> + poll(&input_fd, 1, 0);
> +
> + return input_fd.revents & POLLRDHUP;

WaitEventSetWaitBlock's* defined(WAIT_USE_POLL)* branch uses the following
check to find WL_SOCKET_CLOSED

#ifdef POLLRDHUP
if ((cur_event->events & WL_SOCKET_CLOSED) &&
(cur_pollfd->revents & (POLLRDHUP | errflags)))
{
/* remote peer closed, or error */
occurred_events->events |= WL_SOCKET_CLOSED;
}
#endif

Where *errflags = POLLHUP | POLLERR | POLLNVAL;*

So, should we also be using all these flags like: input_fd.events =
POLLRDHUP | *errflags*; ?

+ ereport(ERROR,
> + (errcode(ERRCODE_CONNECTION_FAILURE),
> + errmsg("could not connect to server \"%s\"",
> + server->servername),
> + errdetail("Socket close is detected."),
> + errhint("Please check the health of the server.")));

Is erroring out always necessary? Maybe we should just return true/false,
and let the caller decide what to do? For example, you might want to
rollback to savepoint and retry? If the caller wants to rollback the whole
transaction, that is possible anyway.

Maybe similar to postgres_fdw_disconnect(), we can warn if the connection
is involved in a tx (e.g., depth > 0)?

new file mode 100644
> index 0000000000..3ce169c837
> --- /dev/null
> +++ b/contrib/postgres_fdw/expected/postgres_fdw_1.out
> @@ -0,0 +1,11615 @@
> +-- ===================================================================
> +-- create FDW objects
> +-- ===================================================================
> +CREATE EXTENSION postgres_fdw;
> +CREATE SERVER testserver1 FOREIGN DATA WRAPPER postgres_fdw;
> +DO $d$
> + BEGIN

Do we really need this new file or just an oversight in your patch?

Thanks,
Onder KALACI

Software Engineer at Microsoft &
Developing the Citus database extension for PostgreSQL

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Konstantin Knizhnik 2022-11-11 10:43:13 Re: Lack of PageSetLSN in heap_xlog_visible
Previous Message John Naylor 2022-11-11 09:21:31 Re: Use pg_pwritev_with_retry() instead of write() in dir_open_for_write() to avoid partial writes?