Retry Cached Remote Connections for postgres_fdw in case remote backend gets killed/goes away

From: Bharath Rupireddy <bharath(dot)rupireddyforpostgres(at)gmail(dot)com>
To: PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Ashutosh Bapat <ashutosh(dot)bapat(dot)oss(at)gmail(dot)com>
Subject: Retry Cached Remote Connections for postgres_fdw in case remote backend gets killed/goes away
Date: 2020-07-08 12:39:54
Message-ID: CALj2ACUAi23vf1WiHNar_LksM9EDOWXcbHCo-fD4Mbr1d=78YQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

Currently with the postgres_fdw remote connections cached in the local
backend, the queries that use the cached connections from local
backend will not check whether the remote backend is killed or gone
away, and it goes tries to submit the query and fails if the remote
backend is killed.

This problem was found during the discussions made in [1].

One way, we could solve the above problem is that, upon firing the new
foreign query from local backend using the cached connection,
(assuming the remote backend that was cached in the local backed got
killed for some reasons), instead of failing the query in the local
backend, upon detecting error from the remote backend, we could just
delete the cached old entry and try getting another connection to
remote backend, cache it and proceed to submit the query. This has to
happen only at the beginning of remote xact.

This way, instead of the query getting failed, the query succeeds if
the local backend is able to get a new remote backend connection.

Attaching the patch that implements the above retry mechanism.

The way I tested the patch:
1. select * from foreign_tbl; /*from local backend - this results in a
remote connection being cached in the postgres_fdw connection cache
and a remote backend is opened.*/
2. (intentionally) kill the remote backend, just to simulate the scenario.
3. select * from foreign_tbl; /*from local backend - without patch
this throws error "ERROR: server closed the connection unexpectedly".
with patch - try to use the cached connection at the beginning of
remote xact, upon receiving error from remote postgres server, instead
of aborting the query, delete the cached entry, try to get a new
connection, if it gets, caches it and uses that for executing the
query, query succeeds.

I couldn't think of adding a test case to the existing postgres_fdw
regression test suite with an automated scenario of the remote backend
getting killed.

I would like to thank Ashutosh Bapat (ashutosh(dot)bapat(dot)oss(at)gmail(dot)com)
for the suggestion to fix this and the review of my initial patch
attached in [2]. I tried to address the review comments provided on my
initial patch [3].

For, one of the Ashutosh's review comments from [3] "the fact that the
same connection may be used by multiple plan nodes", I tried to have
few use cases where there exist joins on two foreign tables on the
same remote server, in a single query, so essentially, the same
connection was used for multiple plan nodes. In this case we avoid
retrying for the second GetConnection() request for the second foreign
table, with the check entry->xact_depth <= 0 , xact_depth after the
first GetConnection() and the first remote query will become 1 and we
don't hit the retry logic and seems like we are safe here. Please add
If I'm missing something here.

Request the community to consider the patch for further review if the
overall idea seems beneficial.

[1] https://www.postgresql.org/message-id/CAExHW5t21B_XPQy_hownm1Qq%3DhMrgOhX%2B8gDj3YEKFvpk%3DVRgw%40mail.gmail.com
[2] https://www.postgresql.org/message-id/CALj2ACXp6DQ3iLGx5g%2BLgVtGwC4F6K9WzKQJpyR4FfdydQzC_g%40mail.gmail.com
[3] https://www.postgresql.org/message-id/CAExHW5u3Gyv6Q1BEr6zMg0t%2B59e8c4KMfKVrV3Z%3D4UKKjJ19nQ%40mail.gmail.com

With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com

Attachment Content-Type Size
v1-Retry-Cached-Remote-Connections-For-postgres_fdw.patch application/octet-stream 7.3 KB

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Paul Guo 2020-07-08 12:56:44 Re: standby recovery fails (tablespace related) (tentative patch and discussion)
Previous Message Amit Kapila 2020-07-08 12:34:54 Re: Global snapshots