Re: Retry Cached Remote Connections for postgres_fdw in case remote backend gets killed/goes away

From: Fujii Masao <masao(dot)fujii(at)oss(dot)nttdata(dot)com>
To: Bharath Rupireddy <bharath(dot)rupireddyforpostgres(at)gmail(dot)com>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Retry Cached Remote Connections for postgres_fdw in case remote backend gets killed/goes away
Date: 2020-09-17 16:50:00
Message-ID: bdf8b42d-fd2e-acf4-0644-2f73f04171c1@oss.nttdata.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 2020/09/17 15:44, Bharath Rupireddy wrote:
> Thanks for the review comments. I will post a new patch soon
> addressing all the comments.

Thanks a lot!

>
> On Tue, Sep 15, 2020 at 2:49 PM Fujii Masao <masao(dot)fujii(at)oss(dot)nttdata(dot)com> wrote:
>>
>> + PQreset(entry->conn);
>>
>> Isn't using PQreset() to reconnect to the foreign server unsafe?
>> When new connection is established, some SET commands seem
>> to need to be issued like configure_remote_session() does.
>> But PQreset() doesn't do that at all.
>>
>
> This is a good catch. Thanks, I missed this point. Indeed we need to
> set the session params. We can do this in two ways: 1. use
> configure_remote_session() after PQreset(). 2. use connect_pg_server()
> instead of PQreset() and configure_remote_session(). One problem I see
> with the 2nd way is that we will be doing the checks that are being
> performed in connect_pg_server() twice, which we would have done for
> the first time before retrying. The required parameters such as
> keywords, values, are still in entry->conn structure from the first
> attempt, which can safely be used by PQreset(). So, I will go with the
> 1st way. Thoughts?

In 1st way, you may also need to call ReleaseExternalFD() when new connection fails
to be made, as connect_pg_server() does. Also we need to check that
non-superuser has used password to make new connection,
as connect_pg_server() does? I'm concerned about the case where
pg_hba.conf is changed accidentally so that no password is necessary
at the remote server and the existing connection is terminated. In this case,
if we connect to the local server as non-superuser, connection to
the remote server should fail because the remote server doesn't
require password. But with your patch, we can successfully reconnect
to the remote server.

Therefore I like 2nd option. Also maybe disconnect_ps_server() needs to
be called before that. I'm not sure how much useful 1st option is.

>
>>
>> Originally when GetConnection() establishes new connection,
>> it resets all transient state fields, to be sure all are clean (as the
>> comment says). With the patch, even when reconnecting
>> the remote server, shouldn't we do the same, for safe?
>>
>
> I guess there is no need for us to reset all the transient state
> before we do begin_remote_xact() in the 2nd attempt. We retry the
> connection only when entry->xact_depth <= 0 i.e. beginning of the
> remote txn and the begin_remote_xact() doesn't modify any transient
> state if entry->xact_depth <= 0, except for entry->changing_xact_state
> = true; all other transient state is intact in entry structure. In the
> error case, we will not reach the code after do_sql_command in
> begin_remote_xact(). If needed, we can only set
> entry->changing_xact_state to false which is set to true before
> do_sql_command().

What if 2nd attempt happens with have_prep_stmt=true? I'm not sure
if this case really happens, though. But if that can, it's strange to start
new connection with have_prep_stmt=true even when the caller of
GetConnection() doesn't intend to create any prepared statements.

I think it's safer to do 2nd attempt in the same way as 1st one. Maybe
we can simplify the code by making them into common code block
or function.

>
> entry->changing_xact_state = true;
> do_sql_command(entry->conn, sql);
> entry->xact_depth = 1;
> entry->changing_xact_state = false;
>
>>
>> + PGresult *res = NULL;
>> + res = PQgetResult(entry->conn);
>> + PQclear(res);
>> Are these really necessary? I was just thinking that's not because
>> pgfdw_get_result() and pgfdw_report_error() seem to do that
>> already in do_sql_command().
>>
>
> If an error occurs in the first attempt, we return from
> pgfdw_get_result()'s if (!PQconsumeInput(conn)) to the catch block we
> added and pgfdw_report_error() will never get called. And the below
> part of the code is reached only in scenarios as mentioned in the
> comments. Removing this might have problems if we receive errors other
> than CONNECTION_BAD or for subtxns. We could clear if any result and
> just rethrow the error upstream. I think no problem having this code
> here.

But the orignal code works without this?
Or you mean that the original code has the bug?

>
> else
> {
> /*
> * We are here, due to either some error other than CONNECTION_BAD
> * occurred or connection may have broken during start of a
> * subtransacation. Just, clear off any result, try rethrowing the
> * error, so that it will be caught appropriately.
> */
> PGresult *res = NULL;
> res = PQgetResult(entry->conn);
> PQclear(res);
> PG_RE_THROW();
> }
>
>>
>> + /* Start a new transaction or subtransaction if needed. */
>> + begin_remote_xact(entry);
>>
>> Even when there is no cached connection and new connection is made,
>> then if begin_remote_xact() reports an error, another new connection
>> tries to be made again. This behavior looks a bit strange.
>>
>
> When there is no cached connection, we try to acquire one, if we
> can't, then no error will be thrown to the user, just we retry one
> more time. If we get in the 2nd attempt, fine, if not, we will throw
> the error to the user. Assume in the 1st attempt the remote server is
> unreachable, we may hope to connect in the 2nd attempt. I think there
> is no problem here.
>
>
> With Regards,
> Bharath Rupireddy.
> EnterpriseDB: http://www.enterprisedb.com
>
>

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2020-09-17 16:53:38 Re: "Unified logging system" breaks access to pg_dump debug outputs
Previous Message Tomas Vondra 2020-09-17 16:34:26 Re: WIP: BRIN multi-range indexes