Slow client can delay replication despite max_standby_streaming_delay set

From: Andrey Borodin <x4mmm(at)yandex-team(dot)ru>
To: PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Slow client can delay replication despite max_standby_streaming_delay set
Date: 2021-11-17 05:39:58
Message-ID: 46726FD0-147E-40CF-8315-87708A467D99@yandex-team.ru
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi hackers!

$subj was recently observed on one of our installations.

Startup process backtrace
#0 0x00007fd216660d27 in epoll_wait (epfd=525, events=0x55c688dfbde8, maxevents=maxevents(at)entry=1, timeout=timeout(at)entry=-1)
#1 0x000055c687264be9 in WaitEventSetWaitBlock (nevents=1, occurred_events=0x7ffdf8089f00, cur_timeout=-1, set=0x55c688dfbd78)
#2 WaitEventSetWait (set=set(at)entry=0x55c688dfbd78, timeout=timeout(at)entry=-1, occurred_events=occurred_events(at)entry=0x7ffdf8089f00, nevents=nevents(at)entry=1, wait_event_info=wait_event_info(at)entry=67108864) at /build/../src/backend/storage/ipc/latch.c:1000
#3 0x000055c687265038 in WaitLatchOrSocket (latch=0x7fd1fa735454, wakeEvents=wakeEvents(at)entry=1, sock=sock(at)entry=-1, timeout=-1, timeout(at)entry=0, wait_event_info=wait_event_info(at)entry=67108864)
#4 0x000055c6872650f5 in WaitLatch (latch=<optimized out>, wakeEvents=wakeEvents(at)entry=1, timeout=timeout(at)entry=0, wait_event_info=wait_event_info(at)entry=67108864
#5 0x000055c687276399 in ProcWaitForSignal (wait_event_info=wait_event_info(at)entry=67108864)
#6 0x000055c68726c898 in ResolveRecoveryConflictWithBufferPin ()
#7 0x000055c6872582c5 in LockBufferForCleanup (buffer=292159)
#8 0x000055c687259447 in ReadBuffer_common (smgr=0x55c688deae40, relpersistence=relpersistence(at)entry=112 'p', forkNum=forkNum(at)entry=MAIN_FORKNUM, blockNum=blockNum(at)entry=3751242, mode=mode(at)entry=RBM_ZERO_AND_CLEANUP_LOCK, strategy=strategy(at)entry=0x0, hit=0x7ffdf808a117 "\001")
#9 0x000055c687259b6b in ReadBufferWithoutRelcache (rnode=..., forkNum=forkNum(at)entry=MAIN_FORKNUM, blockNum=blockNum(at)entry=3751242, mode=mode(at)entry=RBM_ZERO_AND_CLEANUP_LOCK, strategy=strategy(at)entry=0x0)
#10 0x000055c68705655f in XLogReadBufferExtended (rnode=..., forknum=MAIN_FORKNUM, blkno=3751242, mode=RBM_ZERO_AND_CLEANUP_LOCK)
#11 0x000055c687056706 in XLogReadBufferForRedoExtended (record=record(at)entry=0x55c688dd2378, block_id=block_id(at)entry=0 '\000', mode=mode(at)entry=RBM_NORMAL, get_cleanup_lock=get_cleanup_lock(at)entry=1 '\001', buf=buf(at)entry=0x7ffdf808a218)
#12 0x000055c68700728b in heap_xlog_clean (record=0x55c688dd2378)
#13 heap2_redo (record=0x55c688dd2378)
#14 0x000055c68704a7eb in StartupXLOG ()

Backend holding the buffer pin:

#0 0x00007fd216660d27 in epoll_wait (epfd=5, events=0x55c688d67ca8, maxevents=maxevents(at)entry=1, timeout=timeout(at)entry=-1)
#1 0x000055c687264be9 in WaitEventSetWaitBlock (nevents=1, occurred_events=0x7ffdf808e070, cur_timeout=-1, set=0x55c688d67c38)
#2 WaitEventSetWait (set=0x55c688d67c38, timeout=timeout(at)entry=-1, occurred_events=occurred_events(at)entry=0x7ffdf808e070, nevents=nevents(at)entry=1, wait_event_info=wait_event_info(at)entry=100663297)
#3 0x000055c687185d7e in secure_write (port=0x55c688db4a60, ptr=ptr(at)entry=0x55c688dd338e, len=len(at)entry=2666)
#4 0x000055c687190f4c in internal_flush ()
#5 0x000055c68719118a in internal_putbytes (s=0x55c68f8dcc35 "", s(at)entry=0x55c68f8dcc08 "", len=65)
#6 0x000055c687191262 in socket_putmessage (msgtype=<optimized out>, s=0x55c68f8dcc08 "", len=<optimized out>)
#7 0x000055c687193431 in pq_endmessage (buf=buf(at)entry=0x7ffdf808e1a0)
#8 0x000055c686fd1442 in printtup (slot=0x55c6894c2dc0, self=0x55c689326b40)
#9 0x000055c687151962 in ExecutePlan (execute_once=<optimized out>, dest=0x55c689326b40, direction=<optimized out>, numberTuples=0, sendTuples=<optimized out>, operation=CMD_SELECT, use_parallel_mode=<optimized out>, planstate=0x55c689281a28, estate=0x55c6892816f8)
#10 standard_ExecutorRun (queryDesc=0x55c689289628, direction=<optimized out>, count=0, execute_once=<optimized out>)
#11 0x00007fd2074100c5 in pgss_ExecutorRun (queryDesc=0x55c689289628, direction=ForwardScanDirection, count=0, execute_once=<optimized out>)
#12 0x000055c68728a356 in PortalRunSelect (portal=portal(at)entry=0x55c688d858b8, forward=forward(at)entry=1 '\001', count=0, count(at)entry=9223372036854775807, dest=dest(at)entry=0x55c689326b40)
#13 0x000055c68728b988 in PortalRun (portal=portal(at)entry=0x55c688d858b8, count=count(at)entry=9223372036854775807, isTopLevel=isTopLevel(at)entry=1 '\001', run_once=run_once(at)entry=1 '\001', dest=dest(at)entry=0x55c689326b40, altdest=altdest(at)entry=0x55c689326b40, completionTag=0x7ffdf808e580 "")
#14 0x000055c687287425 in exec_simple_query (query_string=0x55c688ec5e38 "select\n '/capacity/created-at-counter-by-time-v2' as sensor,\n round(extract(epoch from shipment_date))
#15 0x000055c687289418 in PostgresMain (argc=<optimized out>, argv=argv(at)entry=0x55c688dd3e28, dbname=<optimized out>, username=<optimized out>)

I think the problem here is that secure_write() uses infinite timeout.
Probably we rely here on tcp keepalives, but they did not fire for some reason. Seems like the client was alive but sluggish.
Does it make sense to look for infinite timeouts in communication and replace them with a loop checking for interrupts?

Best regards, Andrey Borodin.

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Amul Sul 2021-11-17 05:43:32 Re: [Patch] ALTER SYSTEM READ ONLY
Previous Message Michael Paquier 2021-11-17 05:24:59 Re: Deficient error handling in pg_dump and pg_basebackup