Re: Slow client can delay replication despite max_standby_streaming_delay set

From: Fujii Masao <masao(dot)fujii(at)oss(dot)nttdata(dot)com>
To: Andrey Borodin <x4mmm(at)yandex-team(dot)ru>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: Slow client can delay replication despite max_standby_streaming_delay set
Date: 2021-11-18 04:47:50
Message-ID: ce425978-4c10-c534-68e5-c06d8c8e8647@oss.nttdata.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 2021/11/17 14:39, Andrey Borodin wrote:
> Hi hackers!
>
> $subj was recently observed on one of our installations.
>
> Startup process backtrace
> #0 0x00007fd216660d27 in epoll_wait (epfd=525, events=0x55c688dfbde8, maxevents=maxevents(at)entry=1, timeout=timeout(at)entry=-1)
> #1 0x000055c687264be9 in WaitEventSetWaitBlock (nevents=1, occurred_events=0x7ffdf8089f00, cur_timeout=-1, set=0x55c688dfbd78)
> #2 WaitEventSetWait (set=set(at)entry=0x55c688dfbd78, timeout=timeout(at)entry=-1, occurred_events=occurred_events(at)entry=0x7ffdf8089f00, nevents=nevents(at)entry=1, wait_event_info=wait_event_info(at)entry=67108864) at /build/../src/backend/storage/ipc/latch.c:1000
> #3 0x000055c687265038 in WaitLatchOrSocket (latch=0x7fd1fa735454, wakeEvents=wakeEvents(at)entry=1, sock=sock(at)entry=-1, timeout=-1, timeout(at)entry=0, wait_event_info=wait_event_info(at)entry=67108864)
> #4 0x000055c6872650f5 in WaitLatch (latch=<optimized out>, wakeEvents=wakeEvents(at)entry=1, timeout=timeout(at)entry=0, wait_event_info=wait_event_info(at)entry=67108864
> #5 0x000055c687276399 in ProcWaitForSignal (wait_event_info=wait_event_info(at)entry=67108864)
> #6 0x000055c68726c898 in ResolveRecoveryConflictWithBufferPin ()
> #7 0x000055c6872582c5 in LockBufferForCleanup (buffer=292159)
> #8 0x000055c687259447 in ReadBuffer_common (smgr=0x55c688deae40, relpersistence=relpersistence(at)entry=112 'p', forkNum=forkNum(at)entry=MAIN_FORKNUM, blockNum=blockNum(at)entry=3751242, mode=mode(at)entry=RBM_ZERO_AND_CLEANUP_LOCK, strategy=strategy(at)entry=0x0, hit=0x7ffdf808a117 "\001")
> #9 0x000055c687259b6b in ReadBufferWithoutRelcache (rnode=..., forkNum=forkNum(at)entry=MAIN_FORKNUM, blockNum=blockNum(at)entry=3751242, mode=mode(at)entry=RBM_ZERO_AND_CLEANUP_LOCK, strategy=strategy(at)entry=0x0)
> #10 0x000055c68705655f in XLogReadBufferExtended (rnode=..., forknum=MAIN_FORKNUM, blkno=3751242, mode=RBM_ZERO_AND_CLEANUP_LOCK)
> #11 0x000055c687056706 in XLogReadBufferForRedoExtended (record=record(at)entry=0x55c688dd2378, block_id=block_id(at)entry=0 '\000', mode=mode(at)entry=RBM_NORMAL, get_cleanup_lock=get_cleanup_lock(at)entry=1 '\001', buf=buf(at)entry=0x7ffdf808a218)
> #12 0x000055c68700728b in heap_xlog_clean (record=0x55c688dd2378)
> #13 heap2_redo (record=0x55c688dd2378)
> #14 0x000055c68704a7eb in StartupXLOG ()
>
> Backend holding the buffer pin:
>
> #0 0x00007fd216660d27 in epoll_wait (epfd=5, events=0x55c688d67ca8, maxevents=maxevents(at)entry=1, timeout=timeout(at)entry=-1)
> #1 0x000055c687264be9 in WaitEventSetWaitBlock (nevents=1, occurred_events=0x7ffdf808e070, cur_timeout=-1, set=0x55c688d67c38)
> #2 WaitEventSetWait (set=0x55c688d67c38, timeout=timeout(at)entry=-1, occurred_events=occurred_events(at)entry=0x7ffdf808e070, nevents=nevents(at)entry=1, wait_event_info=wait_event_info(at)entry=100663297)
> #3 0x000055c687185d7e in secure_write (port=0x55c688db4a60, ptr=ptr(at)entry=0x55c688dd338e, len=len(at)entry=2666)
> #4 0x000055c687190f4c in internal_flush ()
> #5 0x000055c68719118a in internal_putbytes (s=0x55c68f8dcc35 "", s(at)entry=0x55c68f8dcc08 "", len=65)
> #6 0x000055c687191262 in socket_putmessage (msgtype=<optimized out>, s=0x55c68f8dcc08 "", len=<optimized out>)
> #7 0x000055c687193431 in pq_endmessage (buf=buf(at)entry=0x7ffdf808e1a0)
> #8 0x000055c686fd1442 in printtup (slot=0x55c6894c2dc0, self=0x55c689326b40)
> #9 0x000055c687151962 in ExecutePlan (execute_once=<optimized out>, dest=0x55c689326b40, direction=<optimized out>, numberTuples=0, sendTuples=<optimized out>, operation=CMD_SELECT, use_parallel_mode=<optimized out>, planstate=0x55c689281a28, estate=0x55c6892816f8)
> #10 standard_ExecutorRun (queryDesc=0x55c689289628, direction=<optimized out>, count=0, execute_once=<optimized out>)
> #11 0x00007fd2074100c5 in pgss_ExecutorRun (queryDesc=0x55c689289628, direction=ForwardScanDirection, count=0, execute_once=<optimized out>)
> #12 0x000055c68728a356 in PortalRunSelect (portal=portal(at)entry=0x55c688d858b8, forward=forward(at)entry=1 '\001', count=0, count(at)entry=9223372036854775807, dest=dest(at)entry=0x55c689326b40)
> #13 0x000055c68728b988 in PortalRun (portal=portal(at)entry=0x55c688d858b8, count=count(at)entry=9223372036854775807, isTopLevel=isTopLevel(at)entry=1 '\001', run_once=run_once(at)entry=1 '\001', dest=dest(at)entry=0x55c689326b40, altdest=altdest(at)entry=0x55c689326b40, completionTag=0x7ffdf808e580 "")
> #14 0x000055c687287425 in exec_simple_query (query_string=0x55c688ec5e38 "select\n '/capacity/created-at-counter-by-time-v2' as sensor,\n round(extract(epoch from shipment_date))
> #15 0x000055c687289418 in PostgresMain (argc=<optimized out>, argv=argv(at)entry=0x55c688dd3e28, dbname=<optimized out>, username=<optimized out>)
>
>
> I think the problem here is that secure_write() uses infinite timeout.

Is this the same issue as one reported at [1]?

[1] https://www.postgresql.org/message-id/adce2c09-3bfc-4666-997a-c21991cb1eb1.mengjuan.cmj@alibaba-inc.com

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Amit Kapila 2021-11-18 04:53:40 Re: pg_get_publication_tables() output duplicate relid
Previous Message Justin Pryzby 2021-11-18 04:47:28 Re: pg_upgrade test for binary compatibility of core data types