Re: statement_timeout is not working as expected with postgres_fdw

From: Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp>
To: ashutosh(dot)bapat(at)enterprisedb(dot)com
Cc: suraj(dot)kharage(at)enterprisedb(dot)com, pgsql-hackers(at)postgresql(dot)org
Subject: Re: statement_timeout is not working as expected with postgres_fdw
Date: 2017-04-26 08:30:38
Message-ID: 20170426.173038.12959620.horiguchi.kyotaro@lab.ntt.co.jp
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

At Tue, 25 Apr 2017 15:43:59 +0530, Ashutosh Bapat <ashutosh(dot)bapat(at)enterprisedb(dot)com> wrote in <CAFjFpRe6uf3-nU-orTEAJ0_CKb3MMiPQ5AHJ_SwDguV7Sjs6Ww(at)mail(dot)gmail(dot)com>
> On Tue, Apr 25, 2017 at 1:31 PM, Kyotaro HORIGUCHI
> <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp> wrote:
> >>
> >> The logs above show that 34 seconds elapsed between starting to abort
> >> the transaction and knowing that the foreign server is unreachable. It
> >> looks like it took that much time for the local server to realise that
> >> the foreign server is not reachable. Looking at PQcancel code, it
> >> seems to be trying to connect to the foreign server to cancel the
> >> query. But somehow it doesn't seem to honor connect_timeout setting.
> >> Is that expected?
> >
> > Yes, and No. I think PQcancel requires connection timeout, but I
> > think it is not necessariry the same with that of a foreign
> > server.
>
> Since connect_timeout is property of foreign server, it should be
> honored by any connection made to that server from local server
> including the one by PQcancel().

The two connections have different characteristics. A query
(client) connection should be finally established. On the other
we may give up to establish a cancel connection in certain cases.
For example, when a network stall is anticipated, we can rather
choose to trash the session on it than waiting for the timeout.

> >> Irrespective of what PQcancel does, it looks like postgres_fdw should
> >> just slam the connection if query is being aborted because of
> >> statement_timeout. But then pgfdw_xact_callback() doesn't seem to have
> >> a way to know whether this ABORT is because of user's request to
> >> cancel the query, statement timeout, an abort because of some other
> >> error or a user requested abort. Except statement timeout (may be
> >> user's request to cancel the query?), it should try to keep the
> >> connection around to avoid any future reconnection. But I am not able
> >> to see how can we provide that information to pgfdw_xact_callback().
> >
> > Expiration of statement_timeout doesn't mean a stall of foreign
> > connections. If we are to keep connections by, for example, a
> > cancel request from client, we also should keep them on
> > statememt_timeout because it is not necessariry triggered by a
> > stall of foreign connection.
>
> When statement_timeout completes, we don't want to spend more time in
> trying to cancel queries: esp when there are many foreign server, each
> consuming some "timeout" time OR even trying to send Abort transaction
> statement. Instead, we should slam those down. I consider this to be
> different from query cancellation since query cancellation doesn't
> have a hard bound on time, although we would like to cancel the
> running query as fast as possible. Rethinking about it, probably we
> should slam down the connection in case of query cancel as well.

Yeah, both are rather unusual, in other words, we don't expect it
to happen twice or more in a short interval. With that
prerequisite, I think slamming all down is optimal. But
defenitely not for the commanded rollback case.

> > I think we can detect a stall of the channel where the foreign
> > connections are on by a cancel request with a very short timeout,
> > although it is a bit incorrect.
> >
> > I reconsider this problem and my proposal for this issue is as
> > the follows.
> >
> > - Foreign servers have a new options 'stall_detection_threshold'
> > in milliseconds, maybe defaults to connect_timeout of the
> > foreign server setting. For many foreign servers in a local
> > network, it could be lowered to several tens of milliseconds.
>
> A connect_timeout less than 2 seconds is not encouraged.
> https://www.postgresql.org/docs/devel/static/libpq-connect.html. So,
> we can not set stall_detection_threshold to be smaller than 2 seconds.

It seems coming from DNS lookup timeout so users can set more
short timeout. Anyway the several-tens or hundres millisecond is
an extreme example. But 2 seconds for every connection would be
too long. Anyway the resolution is 1 second..

> statement_timeout however is set in milliseconds, so 2 seconds per
> connection would be quite a lot compared to statement_timeout setting.
> Waiting to cancel query for 2 seconds when the statement_timeout
> itself is 2 seconds would mean the query would be cancelled after 4
> seconds, which is kind of funny.

Yes, we can keep connections in the cases but in turn it can
complel a user to wait for incrediblly long time for a cancel
request. We could reduce the time to wait with some complex stuff
but I don't think we can cancel queries to hundreds of foreign
servers in a second without asynchronisity or parallelism. It
seems to me unreasonablly complex.

In short, I'd like to just discard all connections by the two
kind of triggers.

I suppose that we can inform the cause to the transaction
callback as an event, but I haven't realize how
AbortCurrentTransaction can detect the cause. Letting
ProcessInterrupts set an additional global variable such like
requested_immediate_abort could work.

regards,

--
Kyotaro Horiguchi
NTT Open Source Software Center

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Kyotaro HORIGUCHI 2017-04-26 08:31:28 Re: an outdated comment for hash_seq_init.
Previous Message Konstantin Knizhnik 2017-04-26 07:49:18 Re: Cached plans and statement generalization