Re: Logical decoding and walsender timeouts

From: Vladimir Gordiychuk <folyga(at)gmail(dot)com>
To: Craig Ringer <craig(at)2ndquadrant(dot)com>
Cc: Andres Freund <andres(at)anarazel(dot)de>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>, Petr Jelinek <petr(dot)jelinek(at)2ndquadrant(dot)com>
Subject: Re: Logical decoding and walsender timeouts
Date: 2016-11-01 00:48:37
Message-ID: CAFgjRd2QJVYywihPyNF0f_Vtn3=AZDvziT=AMgmCAoL-6or1hw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

>>> When sending a big message, WalSndWriteData() notices that it's
>>> approaching timeout and tries to send a keepalive request, but the
>>> request just gets buffered behind the remaining output plugin data and
>>> isn't seen by the client until the client has received the rest of the
>>> pending data.
>>
>> Only for individual messages, not the entire transaction though.

>Right. I initially thought it was the whole tx, but I was mistaken as
>I'd failed to notice that WalSndWriteData() queues a keepalive
>request.

This problem can be resolve periodically send keepalive by client, and this
interval should be less than timeout configure on server. For example on
server configure timeout wal_sender_timeout=60 so, client should send keep
alive message to server with interval 60/3. In that case server will not
send keep alive with flag required reply, and also not disconnect client
because during decode huge transaction present check income messages. I
faced a similar problem in pgjdc and resolve it as I write before.

2016-10-31 16:28 GMT+03:00 Craig Ringer <craig(at)2ndquadrant(dot)com>:

> On 31 October 2016 at 16:52, Andres Freund <andres(at)anarazel(dot)de> wrote:
> > Hi,
> >
> > On 2016-10-31 16:34:38 +0800, Craig Ringer wrote:
> >> TL;DR: Logical decoding clients need to generate their own keepalives
> >> and not rely on the server requesting them to prevent timeouts. Or
> >> admins should raise the wal_sender_timeout by a LOT when using logical
> >> decoding on DBs with any big rows.
> >
> > Unconvinced.
>
> Yeah. I've seen enough issues in the wild where we keep timing out and
> restarting over and over until we increase wal_sender_timeout to know
> there's _something_ going on. I am less sure I'm right about what is
> or how to solve it.
>
> >> When sending a big message, WalSndWriteData() notices that it's
> >> approaching timeout and tries to send a keepalive request, but the
> >> request just gets buffered behind the remaining output plugin data and
> >> isn't seen by the client until the client has received the rest of the
> >> pending data.
> >
> > Only for individual messages, not the entire transaction though.
>
> Right. I initially thought it was the whole tx, but I was mistaken as
> I'd failed to notice that WalSndWriteData() queues a keepalive
> request.
>
> > Are
> > you sure the problem at hand is that we're sending a keepalive, but it's
> > too late?
>
> No, I'm not sure. I'm trying to identify the cause of an issue I've
> seen in the wild, but never under conditions where it's been possible
> to sit around and debug in a leisurely manner.
>
> I'm trying to set up a TAP test to demonstrate that this happens, but
> I don't think it's going to work without some kind of network
> bandwidth limitation simulation or simulated latency. A local unix
> socket is just too fast for Pg's row size limits.
>
> > It might very well be that the actual issue is that we're
> > never sending keepalives, because the network is fast enough / the tcp
> > window is large enough. IIRC we only send a keepalive if we're blocked
> > on network IO?
>
> Mm, that's a good point. That might better explain the issues I've
> seen in the wild, since I never found strong evidence that individual
> big rows were involved, but hadn't been able to come up with anything
> else yet.
>
> >> So: We could ask output plugins to deal with this for us, by chunking
> >> up their data in small pieces and calling OutputPluginPrepareWrite()
> >> and OutputPluginWrite() more than once per output plugin callback if
> >> they expect to send a big message. But this pushes the complexity of
> >> splitting up and handling big rows, and big Datums, onto each plugin.
> >> It's awkward to do well and hard to avoid splitting things up
> >> unnecessarily.
> >
> > There's decent reason for doing that independently though, namely that
> > it's a lot more efficient from a memory management POV.
>
> Definitely. Though you're always going to be tossing around ridiculous
> chunks of memory when dealing with big external compressed toasted
> data, unless there are ways to access that progressively that I'm
> unaware of. Hopefully there are.
>
> I'd quite like to extend the bdr/pglogical/logicalrep protocol so that
> in-core logical rep, in some later version, can write a field as 'to
> follow', like we currently mark unchanged toasted datums separately.
> Then send it chunked, after the main row, in follow-up messages. That
> way we keep processing keepalives, we don't allocate preposterous
> amounts of memory, etc.
>
> > I don't think the "unrequested keepalive" approach really solves the
> > problem on a fundamental enough level.
>
> Fair. It feels a bit like flailing in the dark, too.
>
> >> (A separate issue is that we can also time out when doing logical
> >> _replication_ if the downstream side blocks on a lock, since it's not
> >> safe to send on a socket from a signal handler ... )
> >
> > That's strictly speaking not true. write() / sendmsg() are signal safe
> > functions. There's good reasons not to do that however, namely that the
> > non signal handler code might be busy writing data itself.
>
> Huh, ok. And since in pglogical/bdr and as far as I can tell in core
> logical rep we don't send anything on the socket while we're calling
> in to heap access, the executor, etc, that'd actually be an option. We
> could possibly safeguard it with a volatile "socket busy" flag since
> we don't do much sending anyway. But I'd need to do my reading on
> signal handler safety etc. Still, good to know it's not completely
> absurd to do this if the issue comes up.
>
> Thanks very much for the input. I saw your post with proposed changes.
> Once I can get the issue simulated reliably I'll test the patch and
> see if it solves it, but it looks like it's sensible to just apply it
> anyway TBH.
>
> --
> Craig Ringer http://www.2ndQuadrant.com/
> PostgreSQL Development, 24x7 Support, Training & Services
>
>
> --
> Sent via pgsql-hackers mailing list (pgsql-hackers(at)postgresql(dot)org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers
>

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Haribabu Kommi 2016-11-01 01:27:31 Re: commit fest manager for CF 2016-11?
Previous Message neha khatri 2016-11-01 00:10:10 Re: Unsafe use of relation->rd_options without checking its type