Re: walsender vs. XLogBackgroundFlush during shutdown

From: Alexander Kukushkin <cyberdemn(at)gmail(dot)com>
To: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
Cc: pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject: Re: walsender vs. XLogBackgroundFlush during shutdown
Date: 2019-05-01 12:13:10
Message-ID: CAFh8B=m5jdPaH3v1buBV1EVbMOjTFoZGPaCPTc_Ws2eQTYimUg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi Tomas,

On Wed, 1 May 2019 at 02:28, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com> wrote:

> I see there's an ongoing discussion about race conditions in walreceiver
> blocking shutdown, so let me start a new thread about a race condition in
> walsender during shutdown.
>
> The symptoms are rather simple - 'pg_ctl -m fast shutdown' gets stuck,
> waiting for walsender processes to catch-up and terminate indefinitely.

I can confirm, during the past couple of years we observed such a
problem a few times and this is really annoying.

> The reason for that is pretty simple - the walsenders are doing logical
> decoding, and during shutdown they're waiting for WalSndCaughtUp=true.
> But per XLogSendLogical() this only happens if
>
> if (logical_decoding_ctx->reader->EndRecPtr >= GetFlushRecPtr())
> {
> WalSndCaughtUp = true;
> ...
> }

After a couple of days investigating and debugging I came to a
slightly different conclusion, WalSndCaughtUp is set to true in my
case.
Since I am mostly a python person, I decided to use psycopg2 for my
investigation. I took an example from
http://initd.org/psycopg/docs/advanced.html#logical-replication-quick-start
as a starting point, created a slot and started the script.
I wasn't running any transactions, therefore the DemoConsumer.__call__
was never executed and cursor.send_feedback(flush_lsn=msg.data_start)
was never called either. Basically, the only what the psycopg2
internals was doing - periodically sending keepalive messages or
replying to keepalives sent by postgres. In the postgres debug log
they are visible as:

2019-05-01 12:58:32.785 CEST [13939] DEBUG: write 0/0 flush 0/0 apply 0/0

If you try to do a fast shutdown of postgres while the script is
running, it will never finish, and in the postgres log you will get
indefinite stream of messages:
2019-05-01 13:00:02.880 CEST [13939] DEBUG: write 0/0 flush 0/0 apply 0/0
2019-05-01 13:00:02.880 CEST [13939] DEBUG: sending replication keepalive
2019-05-01 13:00:02.880 CEST [13939] DEBUG: write 0/0 flush 0/0 apply 0/0
2019-05-01 13:00:02.880 CEST [13939] DEBUG: sending replication keepalive
2019-05-01 13:00:02.881 CEST [13939] DEBUG: write 0/0 flush 0/0 apply 0/0
2019-05-01 13:00:02.881 CEST [13939] DEBUG: sending replication keepalive
2019-05-01 13:00:02.881 CEST [13939] DEBUG: write 0/0 flush 0/0 apply 0/0
2019-05-01 13:00:02.881 CEST [13939] DEBUG: sending replication keepalive
2019-05-01 13:00:02.881 CEST [13939] DEBUG: write 0/0 flush 0/0 apply 0/0
2019-05-01 13:00:02.881 CEST [13939] DEBUG: sending replication keepalive
2019-05-01 13:00:02.881 CEST [13939] DEBUG: write 0/0 flush 0/0 apply 0/0

Actually, the same problem will happen even in the case when the
consumer script receives some message, but not very intensively, but
it is just a bit harder to reproduce it.

If you attach to the walsender with gdb, you'll see the following picture:
(gdb) bt
#0 0x00007fd6623d296a in __libc_send (fd=8, buf=0x55cb958dca08,
len=94, flags=0) at ../sysdeps/unix/sysv/linux/send.c:28
#1 0x000055cb93aa7ce9 in secure_raw_write (port=0x55cb958d71e0,
ptr=0x55cb958dca08, len=94) at be-secure.c:318
#2 0x000055cb93aa7b87 in secure_write (port=0x55cb958d71e0,
ptr=0x55cb958dca08, len=94) at be-secure.c:265
#3 0x000055cb93ab6bf9 in internal_flush () at pqcomm.c:1433
#4 0x000055cb93ab6b89 in socket_flush () at pqcomm.c:1409
#5 0x000055cb93dac30b in send_message_to_frontend
(edata=0x55cb942b4380 <errordata>) at elog.c:3317
#6 0x000055cb93da8973 in EmitErrorReport () at elog.c:1481
#7 0x000055cb93da5abf in errfinish (dummy=0) at elog.c:481
#8 0x000055cb93da852d in elog_finish (elevel=13, fmt=0x55cb93f32de3
"sending replication keepalive") at elog.c:1376
#9 0x000055cb93bcae71 in WalSndKeepalive (requestReply=true) at
walsender.c:3358
#10 0x000055cb93bca062 in WalSndDone (send_data=0x55cb93bc9e29
<XLogSendLogical>) at walsender.c:2872
#11 0x000055cb93bc9155 in WalSndLoop (send_data=0x55cb93bc9e29
<XLogSendLogical>) at walsender.c:2194
#12 0x000055cb93bc7b11 in StartLogicalReplication (cmd=0x55cb95931cc0)
at walsender.c:1109
#13 0x000055cb93bc83d6 in exec_replication_command
(cmd_string=0x55cb958b2360 "START_REPLICATION SLOT \"test\" LOGICAL
0/00000000") at walsender.c:1541
#14 0x000055cb93c31653 in PostgresMain (argc=1, argv=0x55cb958deb68,
dbname=0x55cb958deb48 "postgres", username=0x55cb958deb28
"akukushkin") at postgres.c:4178
#15 0x000055cb93b95185 in BackendRun (port=0x55cb958d71e0) at postmaster.c:4361
#16 0x000055cb93b94824 in BackendStartup (port=0x55cb958d71e0) at
postmaster.c:4033
#17 0x000055cb93b90ccd in ServerLoop () at postmaster.c:1706
#18 0x000055cb93b90463 in PostmasterMain (argc=3, argv=0x55cb958ac710)
at postmaster.c:1379
#19 0x000055cb93abb08e in main (argc=3, argv=0x55cb958ac710) at main.c:228
(gdb) f 10
#10 0x000055cb93bca062 in WalSndDone (send_data=0x55cb93bc9e29
<XLogSendLogical>) at walsender.c:2872
2872 WalSndKeepalive(true);
(gdb) p WalSndCaughtUp
$1 = true
(gdb) p *MyWalSnd
$2 = {pid = 21845, state = WALSNDSTATE_STREAMING, sentPtr = 23586168,
needreload = false, write = 0, flush = 23586112, apply = 0, writeLag =
-1, flushLag = -1, applyLag = -1, mutex = 0 '\000', latch =
0x7fd66096b594, sync_standby_priority = 0}

As you can see, the value of WalSndCaughtUp is set to true! The
shutdown never finishes because the value of sentPtr is higher than
values of MyWalSnd->flush or MyWalSnd->write:
(gdb) l 2858
2853 /*
2854 * To figure out whether all WAL has successfully been
replicated, check
2855 * flush location if valid, write otherwise. Tools
like pg_receivewal will
2856 * usually (unless in synchronous mode) return an
invalid flush location.
2857 */
2858 replicatedPtr = XLogRecPtrIsInvalid(MyWalSnd->flush) ?
2859 MyWalSnd->write : MyWalSnd->flush;
2860
2861 if (WalSndCaughtUp && sentPtr == replicatedPtr &&
2862 !pq_is_send_pending())
2863 {
2864 /* Inform the standby that XLOG streaming is done */
2865 EndCommand("COPY 0", DestRemote);
2866 pq_flush();
2867
2868 proc_exit(0);
2869 }

What is more interesting, if one is using pg_recvlogical, it is not
possible to reproduce the issue. That happens because pg_recvlogical
sends the response on keepalive messages by sending the flush location
equals to walEnd which it got from keepalive.

As a next logical step I tried to do the same in python with psycopg2.
Unfortunately, the keepalive functionality is hidden in the C code and
it is not possible to change it without recompiling the psycopg2, but
there is an asynchronous interface available:
http://initd.org/psycopg/docs/extras.html#psycopg2.extras.ReplicationCursor.wal_end.
I just had to do just one minor adjustment:

try:
sel = select([cur], [], [], max(0, timeout))
if not any(sel):
- cur.send_feedback() # timed out, send keepalive message
+ cur.send_feedback(flush_lsn=cur.wal_end) # timed out,
send keepalive message
except InterruptedError:
pass # recalculate timeout and continue

wal_end шы еру property of the cursor object and it is updated not
only for every message received, but also from keepalive messages.
Basically such a little change made the python example behavior very
similar to the pg_recvlogical.

All above text probably looks like a brain dump, but I don't think
that it conflicts with Tomas's findings it rather compliments them.
I am very glad that now I know how to mitigate the problem on the
client side, but IMHO it is also very important to fix the server
behavior if it is ever possible.

Regards,
--
Alexander Kukushkin

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Bruce Momjian 2019-05-01 13:38:56 to_timestamp docs
Previous Message Fabien COELHO 2019-05-01 10:02:48 Re: [PATCH v4] Add \warn to psql