Re: [PATCH] postgres_fdw connection caching - cause remote sessions linger till the local session exit

From: Bharath Rupireddy <bharath(dot)rupireddyforpostgres(at)gmail(dot)com>
To: Fujii Masao <masao(dot)fujii(at)oss(dot)nttdata(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: [PATCH] postgres_fdw connection caching - cause remote sessions linger till the local session exit
Date: 2021-01-29 05:46:54
Message-ID: CALj2ACXhM_yTBXXTCA-whZ_oA1x3DuGxyKnsROjMh8YZLG9bJw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Fri, Jan 29, 2021 at 11:08 AM Bharath Rupireddy
<bharath(dot)rupireddyforpostgres(at)gmail(dot)com> wrote:
>
> On Fri, Jan 29, 2021 at 10:55 AM Fujii Masao
> <masao(dot)fujii(at)oss(dot)nttdata(dot)com> wrote:
> > On 2021/01/29 14:12, Bharath Rupireddy wrote:
> > > On Fri, Jan 29, 2021 at 10:28 AM Fujii Masao
> > > <masao(dot)fujii(at)oss(dot)nttdata(dot)com> wrote:
> > >> On 2021/01/29 11:09, Tom Lane wrote:
> > >>> Bharath Rupireddy <bharath(dot)rupireddyforpostgres(at)gmail(dot)com> writes:
> > >>>> On Fri, Jan 29, 2021 at 1:52 AM Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> > >>>>> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=trilobite&dt=2021-01-26%2019%3A59%3A40
> > >>>>> This is a CLOBBER_CACHE_ALWAYS build, so I suspect what it's
> > >>>>> telling us is that the patch's behavior is unstable in the face
> > >>>>> of unexpected cache flushes.
> > >>>
> > >>>> Thanks a lot! It looks like the syscache invalidation messages are
> > >>>> generated too frequently with -DCLOBBER_CACHE_ALWAYS build due to
> > >>>> which pgfdw_inval_callback gets called many times in which the cached
> > >>>> entries are marked as invalid and closed if they are not used in the
> > >>>> txn. The new function postgres_fdw_get_connections outputs the
> > >>>> information of the cached connections such as name if the connection
> > >>>> is still open and their validity. Hence the output of the
> > >>>> postgres_fdw_get_connections became unstable in the buildfarm member.
> > >>>> I will further analyze making tests stable, meanwhile any suggestions
> > >>>> are welcome.
> > >>>
> > >>> I do not think you should regard this as "we need to hack the test
> > >>> to make it stable". I think you should regard this as "this is a
> > >>> bug". A cache flush should not cause user-visible state changes.
> > >>> In particular, the above analysis implies that you think a cache
> > >>> flush is equivalent to end-of-transaction, which it absolutely
> > >>> is not.
> > >>>
> > >>> Also, now that I've looked at pgfdw_inval_callback, it scares
> > >>> the heck out of me. Actually disconnecting a connection during
> > >>> a cache inval callback seems quite unsafe --- what if that happens
> > >>> while we're using the connection?
> > >>
> > >> If the connection is still used in the transaction, pgfdw_inval_callback()
> > >> marks it as invalidated and doesn't close it. So I was not thinking that
> > >> this is so unsafe.
> > >>
> > >> The disconnection code in pgfdw_inval_callback() was added in commit
> > >> e3ebcca843 to fix connection leak issue, and it's back-patched. If this
> > >> change is really unsafe, we need to revert it immediately at least from back
> > >> branches because the next minor release is scheduled soon.
> > >
> > > I think we can remove disconnect_pg_server in pgfdw_inval_callback and
> > > make entries only invalidated. Anyways, those connections can get
> > > closed at the end of main txn in pgfdw_xact_callback. Thoughts?
> >
> > But this revives the connection leak issue. So isn't it better to
> > to do that after we confirm that the current code is really unsafe?
>
> IMO, connections will not leak, because the invalidated connections
> eventually will get closed in pgfdw_xact_callback at the main txn end.
>
> IIRC, when we were finding a way to close the invalidated connections
> so that they don't leaked, we had two options:
>
> 1) let those connections (whether currently being used in the xact or
> not) get marked invalidated in pgfdw_inval_callback and closed in
> pgfdw_xact_callback at the main txn end as shown below
>
> if (PQstatus(entry->conn) != CONNECTION_OK ||
> PQtransactionStatus(entry->conn) != PQTRANS_IDLE ||
> entry->changing_xact_state ||
> entry->invalidated). ----> by adding this
> {
> elog(DEBUG3, "discarding connection %p", entry->conn);
> disconnect_pg_server(entry);
> }
>
> 2) close the unused connections right away in pgfdw_inval_callback
> instead of marking them invalidated. Mark used connections as
> invalidated in pgfdw_inval_callback and close them in
> pgfdw_xact_callback at the main txn end.
>
> We went with option (2) because we thought this would ease some burden
> on pgfdw_xact_callback closing a lot of invalid connections at once.

Also, see the original patch for the connection leak issue just does
option (1), see [1]. But in [2] and [3], we chose option (2).

I feel, we can go for option (1), with the patch attached in [1] i.e.
having have_invalid_connections whenever any connection gets invalided
so that we don't quickly exit in pgfdw_xact_callback and the
invalidated connections get closed properly. Thoughts?

static void
pgfdw_xact_callback(XactEvent event, void *arg)
{
HASH_SEQ_STATUS scan;
ConnCacheEntry *entry;

/* Quick exit if no connections were touched in this transaction. */
if (!xact_got_connection)
return;

[1] https://www.postgresql.org/message-id/CALj2ACVNcGH_6qLY-4_tXz8JLvA%2B4yeBThRfxMz7Oxbk1aHcpQ%40mail.gmail.com
[2] https://www.postgresql.org/message-id/f57dd9c3-0664-5f4c-41f0-0713047ae7b7%40oss.nttdata.com
[3] https://www.postgresql.org/message-id/CALj2ACVNjV1%2B72f3nVCngC7RsGSiGXZQ2mAzYx_Dij7oJpV8iA%40mail.gmail.com

With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Amit Kapila 2021-01-29 05:48:11 Re: Why does creating logical replication subscriptions require superuser?
Previous Message David Rowley 2021-01-29 05:41:52 Re: Hybrid Hash/Nested Loop joins and caching results from subplans