Re: Race conditions in 019_replslot_limit.pl

From: Andres Freund <andres(at)anarazel(dot)de>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>, Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>, hlinnaka(at)iki(dot)fi, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Race conditions in 019_replslot_limit.pl
Date: 2022-02-18 23:40:10
Message-ID: 20220218234010.elria4uupyaxg3ld@alap3.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

On 2022-02-18 18:15:21 -0500, Tom Lane wrote:
> Andres Freund <andres(at)anarazel(dot)de> writes:
> > On 2022-02-17 21:55:21 -0800, Andres Freund wrote:
> >> Isn't it pretty bonkers that we allow error processing to get stuck behind
> >> network traffic, *before* we have have released resources (locks etc)?
>
> It's more or less intentional, per elog.c:
>
> /*
> * This flush is normally not necessary, since postgres.c will flush out
> * waiting data when control returns to the main loop. But it seems best
> * to leave it here, so that the client has some clue what happened if the
> * backend dies before getting back to the main loop ... error/notice
> * messages should not be a performance-critical path anyway, so an extra
> * flush won't hurt much ...
> */
> pq_flush();
>
> Perhaps it'd be sensible to do this only in debugging (ie Assert)
> builds?

That seems not great, because it pretty clearly can lead to hangs, which is
problematic in tests too. What about using pq_flush_if_writable()? In nearly
all situations that'd still push the failure to the client.

We'd also need to add pq_endmessage_noblock(), because the pq_endmessage()
obviously tries to send (as in the backtrace upthread) if the output buffer is
large enough, which it often will be in walsender.

I guess we could try to flush in a blocking manner sometime later in the
shutdown sequence, after we've released resources? But I'm doubtful it's a
good idea, we don't really want to block waiting to exit when e.g. the network
connection is dead without the TCP stack knowing.

Hm. There already is code trying to short-circuit sending errors to the client
if a backend gets terminated. Introduced in

commit 2ddb9149d14de9a2e7ac9ec6accf3ad442702b24
Author: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Date: 2018-10-19 21:39:21 -0400

Server-side fix for delayed NOTIFY and SIGTERM processing.

and predecessors.

If ProcessClientWriteInterrupt() sees ProcDiePending, we'll stop trying to
send stuff to the client if writes block. However, this doesn't work here,
because we've already unset ProcDiePending:

#7 0x00007faf4b71c151 in pq_endmessage (buf=0x7ffe47df2460) at /home/andres/src/postgresql/src/backend/libpq/pqformat.c:301
#8 0x00007faf4babbb5e in send_message_to_frontend (edata=0x7faf4be2f1c0 <errordata>) at /home/andres/src/postgresql/src/backend/utils/error/elog.c:3253
#9 0x00007faf4bab8aa0 in EmitErrorReport () at /home/andres/src/postgresql/src/backend/utils/error/elog.c:1541
#10 0x00007faf4bab5ec0 in errfinish (filename=0x7faf4bc9770d "postgres.c", lineno=3192, funcname=0x7faf4bc99170 <__func__.8> "ProcessInterrupts")
at /home/andres/src/postgresql/src/backend/utils/error/elog.c:592
#11 0x00007faf4b907e73 in ProcessInterrupts () at /home/andres/src/postgresql/src/backend/tcop/postgres.c:3192
#12 0x00007faf4b8920af in WalSndLoop (send_data=0x7faf4b8927f2 <XLogSendPhysical>) at /home/andres/src/postgresql/src/backend/replication/walsender.c:2404
#13 0x00007faf4b88f82e in StartReplication (cmd=0x7faf4c293fc0) at /home/andres/src/postgresql/src/backend/replication/walsender.c:834

Before ProcessInterrupts() FATALs due to a SIGTERM, it resets ProcDiePending.

This seems not optimal.

We can't just leave ProcDiePending set, otherwise we'll probably end up
throwing more errors during the shutdown sequence. But it seems we need
something similar to proc_exit_inprogress, except set earlier? And then take
that into account in ProcessClientWriteInterrupt()?

Greetings,

Andres Freund

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2022-02-18 23:49:14 Re: Race conditions in 019_replslot_limit.pl
Previous Message Tom Lane 2022-02-18 23:27:11 Re: Time to drop plpython2?