Re: Race conditions in 019_replslot_limit.pl

From: Andres Freund <andres(at)anarazel(dot)de>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>, Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>, hlinnaka(at)iki(dot)fi, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Race conditions in 019_replslot_limit.pl
Date: 2022-02-19 00:08:37
Message-ID: 20220219000837.q32ckxgpksulfwnf@alap3.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

On 2022-02-18 18:49:14 -0500, Tom Lane wrote:
> Andres Freund <andres(at)anarazel(dot)de> writes:
> > We'd also need to add pq_endmessage_noblock(), because the pq_endmessage()
> > obviously tries to send (as in the backtrace upthread) if the output buffer is
> > large enough, which it often will be in walsender.
>
> I don't see that as "obvious". If we're there, we do not have an
> error situation.

The problem is that due walsender using pq_putmessage_noblock(), the output
buffer is often going to be too full for a plain pq_endmessage() to not send
out data. Because walsender will have an output buffer > PQ_SEND_BUFFER_SIZE
a lot of the time, errors will commonly be thrown with the output buffer
full.

That leads the pq_endmessage() in send_message_to_frontend() to block sending
the error message and the preceding WAL data. Leading to backtraces like:

> #0 0x00007faf4a570ec6 in epoll_wait (epfd=4, events=0x7faf4c201458, maxevents=1, timeout=-1) at ../sysdeps/unix/sysv/linux/epoll_wait.c:30
> #1 0x00007faf4b8ced5c in WaitEventSetWaitBlock (set=0x7faf4c2013e0, cur_timeout=-1, occurred_events=0x7ffe47df2320, nevents=1)
> at /home/andres/src/postgresql/src/backend/storage/ipc/latch.c:1465
> #2 0x00007faf4b8cebe3 in WaitEventSetWait (set=0x7faf4c2013e0, timeout=-1, occurred_events=0x7ffe47df2320, nevents=1, wait_event_info=100663297)
> at /home/andres/src/postgresql/src/backend/storage/ipc/latch.c:1411
> #3 0x00007faf4b70f48b in secure_write (port=0x7faf4c22da50, ptr=0x7faf4c2f1210, len=21470) at /home/andres/src/postgresql/src/backend/libpq/be-secure.c:298
> #4 0x00007faf4b71aecb in internal_flush () at /home/andres/src/postgresql/src/backend/libpq/pqcomm.c:1380
> #5 0x00007faf4b71ada1 in internal_putbytes (s=0x7ffe47df23dc "E\177", len=1) at /home/andres/src/postgresql/src/backend/libpq/pqcomm.c:1326
> #6 0x00007faf4b71b0cf in socket_putmessage (msgtype=69 'E', s=0x7faf4c201700 "SFATAL", len=112)
> at /home/andres/src/postgresql/src/backend/libpq/pqcomm.c:1507
> #7 0x00007faf4b71c151 in pq_endmessage (buf=0x7ffe47df2460) at /home/andres/src/postgresql/src/backend/libpq/pqformat.c:301
> #8 0x00007faf4babbb5e in send_message_to_frontend (edata=0x7faf4be2f1c0 <errordata>) at /home/andres/src/postgresql/src/backend/utils/error/elog.c:3253
> #9 0x00007faf4bab8aa0 in EmitErrorReport () at /home/andres/src/postgresql/src/backend/utils/error/elog.c:1541
> #10 0x00007faf4bab5ec0 in errfinish (filename=0x7faf4bc9770d "postgres.c", lineno=3192, funcname=0x7faf4bc99170 <__func__.8> "ProcessInterrupts")
> at /home/andres/src/postgresql/src/backend/utils/error/elog.c:592
> #11 0x00007faf4b907e73 in ProcessInterrupts () at /home/andres/src/postgresql/src/backend/tcop/postgres.c:3192
> #12 0x00007faf4b8920af in WalSndLoop (send_data=0x7faf4b8927f2 <XLogSendPhysical>) at /home/andres/src/postgresql/src/backend/replication/walsender.c:2404

> > I guess we could try to flush in a blocking manner sometime later in the
> > shutdown sequence, after we've released resources? But I'm doubtful it's a
> > good idea, we don't really want to block waiting to exit when e.g. the network
> > connection is dead without the TCP stack knowing.
>
> I think you are trying to move in the direction of possibly exiting
> without ever sending at all, which does NOT seem like an improvement.

I was just talking about doing another blocking pq_flush(), in addition to the
pq_flush_if_writable() earlier. That'd be trying harder to send out data, not
less hard...

Greetings,

Andres Freund

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Peter Geoghegan 2022-02-19 00:12:00 Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations
Previous Message Andres Freund 2022-02-19 00:02:18 Re: Mark all GUC variable as PGDLLIMPORT