RE: [Proposal] Add foreign-server health checks infrastructure

From: "kuroda(dot)hayato(at)fujitsu(dot)com" <kuroda(dot)hayato(at)fujitsu(dot)com>
To: 'Önder Kalacı' <onderkalaci(at)gmail(dot)com>, Fujii Masao <masao(dot)fujii(at)oss(dot)nttdata(dot)com>
Cc: Andres Freund <andres(at)anarazel(dot)de>, "pgsql-hackers(at)lists(dot)postgresql(dot)org" <pgsql-hackers(at)lists(dot)postgresql(dot)org>, "Shinya11(dot)Kato(at)oss(dot)nttdata(dot)com" <Shinya11(dot)Kato(at)oss(dot)nttdata(dot)com>, "zyu(at)yugabyte(dot)com" <zyu(at)yugabyte(dot)com>, Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>
Subject: RE: [Proposal] Add foreign-server health checks infrastructure
Date: 2022-10-05 09:27:07
Message-ID: TYAPR01MB5866CE34430424A588F60FC2F55D9@TYAPR01MB5866.jpnprd01.prod.outlook.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Dear Önder, Fujii-san,

Thanks for giving many comments! Based on off and on discussions, I modified my patch.
Note that this patch-set did not contain all of fixes you said.

[Modified]

1.
QueryCancelPending was cleaned up when the query cancel was skipped.
In the previous version the QueryCancelPending was set to true again even if the ereport(ERROR) was skipped.
It was because I thought the transaction that accessed to foreign servers but one of them crashed should be aborted immediately.
But currenlty postgres does not have such "delayed" query cancel mechanism, so I followed the manner.

Another approach is documenting like "If QueryCancelPending is already filled, the callback function must send SIGINT signal".
This seems simpler for us, but it may be troublesome for FDW authors. How do you think?

2.
The new filed servername was added to the hash entry. It was needed
because the server name must be logged in the pgfdw_connection_check(), but according to [1],
the catalog should not be read in ProcessInterrupts() to avoid the race condition.
The string is pstrdup()'d when the entry has been created, and replaced when ALTER SERER RENAME has been executed.

3.
The manner to implement the health check function was documented.

4.
raise() was replaced to kill(), because it was not used in core code.

5.
APIs to handle QueryCancelMessage were added to avoid overriding the message.

6.
Some checks for socket was added in the checking function.

[1]: https://www.postgresql.org/message-id/CA%2BTgmoYW_rSOW4JMQ9_0Df9PKQ%3DsQDOKUGA4Gc9D8w4wui8fSA%40mail.gmail.com

[Unmodified]

a.
The checking function is still in the postgres_fdw.
This is because I could not find any good file to move this function.
IIUC this function needs PGconn as an argument,
but it seems that it is declared in libpq-fe.h and it should not be included from core.

b.
In the checking function, the eventset is still created, added, and freed.
Firstly I tried to declare eventset as inner-function static variable and reuse that, but I could not do. This is because:

* The size of eventset cannot be determined when the function is called for the first time.
The size must be set in CreateWaitEventSet(), and it must be more than the number of remote connection.
* To reuse the eventset the added event must be removed every time, but it cannot.
The API ModifyWaitEvent() can be only used for "modifying" the event, not "removing".
If the function is called as events=0, we will get assertion failure in WaitEventAdjustEpoll() or something.

To avoid creating/freeing the event set, hash table must be modified but it has not done yet.

c.
Currently the status of socket is not checked in the GetConnection().
I agreed it may be useful, but I thought it might be out of scope. It could not meet the requirements.
I think it can be added separately.

Best Regards,
Hayato Kuroda
FUJITSU LIMITED

Attachment Content-Type Size
v16-0001-Add-an-infrastracture-for-checking-remote-server.patch application/octet-stream 8.8 KB
v16-0002-postgres_fdw-Implement-health-check-feature.patch application/octet-stream 12.6 KB
v16-0003-add-doc.patch application/octet-stream 4.7 KB
v16-0004-add-test.patch application/octet-stream 563.4 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message John Naylor 2022-10-05 09:40:31 Re: [PoC] Improve dead tuple storage for lazy vacuum
Previous Message Alvaro Herrera 2022-10-05 08:42:57 Re: [Commitfest 2022-09] Date is Over.