|From:||Andres Freund <andres(at)anarazel(dot)de>|
|To:||Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>|
|Cc:||Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>, Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>, hlinnaka(at)iki(dot)fi, pgsql-hackers(at)postgresql(dot)org|
|Subject:||Re: Race conditions in 019_replslot_limit.pl|
|Views:||Raw Message | Whole Thread | Download mbox | Resend email|
On 2022-02-18 14:42:48 -0800, Andres Freund wrote:
> On 2022-02-17 21:55:21 -0800, Andres Freund wrote:
> > Isn't it pretty bonkers that we allow error processing to get stuck behind
> > network traffic, *before* we have have released resources (locks etc)?
> This is particularly likely to be a problem for walsenders, because they often
> have a large output buffer filled, because walsender uses
> pq_putmessage_noblock() to send WAL data. Which obviously can be large.
> In the stacktrace upthread you can see:
> #3 0x00007faf4b70f48b in secure_write (port=0x7faf4c22da50, ptr=0x7faf4c2f1210, len=21470) at /home/andres/src/postgresql/src/backend/libpq/be-secure.c:29
> which certainly is more than in most other cases of error messages being
> sent. And it obviously might not be the first to have gone out.
> > I wonder if we should try to send, but do it in a nonblocking way.
> I think we should probably do so at least during FATAL error processing. But
> also consider doing so for ERROR, because not releasing resources after
> getting cancelled / terminated is pretty nasty imo.
Is it possible that what we're seeing is a deadlock, with both walsender and
the pg_basebackup child trying to send data, but neither receiving?
But that'd require that somehow the basebackup child process didn't exit with
its parent. And I don't really see how that'd happen.
I'm running out of ideas for how to try to reproduce this. I think we might
need some additional debugging information to get more information from the
I'm thinking of adding log_min_messages=DEBUG2 to primary3, passing --verbose
to pg_basebackup in $node_primary3->backup(...).
It might also be worth adding DEBUG2 messages to ReplicationSlotShmemExit(),
|Next Message||Tom Lane||2022-02-18 23:15:21||Re: Race conditions in 019_replslot_limit.pl|
|Previous Message||Tom Lane||2022-02-18 23:09:19||Re: Time to drop plpython2?|