BUG #17345: pg_basebackup stucked for 2 hours before timeout

From: PG Bug reporting form <noreply(at)postgresql(dot)org>
To: pgsql-bugs(at)lists(dot)postgresql(dot)org
Cc: bchen90(at)163(dot)com
Subject: BUG #17345: pg_basebackup stucked for 2 hours before timeout
Date: 2021-12-27 03:53:30
Message-ID: 17345-a66a0084532b7beb@postgresql.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

The following bug has been logged on the website:

Bug reference: 17345
Logged by: Bo Chen
Email address: bchen90(at)163(dot)com
PostgreSQL version: 11.13
Operating system: euleros v2r9 x86_64
Description:

Hello experts,
I am facing an issue for pg_basebackup in docker env. when the primary
VM restarted while pg_basebackup is running on the standby docker in VM. It
takes 2 hours before pg_basebackup times out.
After analysis and reproduce the problem, I think the reason is the
parent process for fetching data files is blocking for tcp keeplive, and it
ignore or block SIGCHLD when running poll API. So we add signaling the
parent when fetching wal exit not zero.

Belowing is the modifing code.
#include "streamutil.h"
+#include <sys/prctl.h>

#define ERRCODE_DATA_CORRUPTED "XX001"

@@ -565,6 +566,8 @@ StartLogStreamer(char *startpos, uint32 timeline, char
*sysidentifier)
uint32 hi,
lo;
char statusdir[MAXPGPATH];
+ pid_t bgpid;
+ int ret;

param = pg_malloc0(sizeof(logstreamer_param));
param->timeline = timeline;
@@ -662,12 +665,24 @@ StartLogStreamer(char *startpos, uint32 timeline, char
*sysidentifier)
* a fork(). On Windows, we create a thread.
*/
#ifndef WIN32
+ bgpid = getpid();
+
bgchild = fork();
if (bgchild == 0)
{
+ (void)prctl(PR_SET_PDEATHSIG, SIGQUIT);
/* in child process */
- exit(LogStreamerMain(param));
+ ret = LogStreamerMain(param);
+ if (ret != 0)
+ {
+ kill(bgpid, SIGINT);
+ }
+ exit(ret);
}
else if (bgchild < 0)
{

This is the stacks when pg_basebackup stucking
#0 0xf7f6e039 in __kernel_vsyscall ()
#1 0xf7a1f2ea in poll () from /usr/lib/libc.so.6
#2 0xf7b25ea0 in pqSocketPoll (sock=5, forRead=1, forWrite=0, end_time=-1)
at fe-misc.c:1127

Belowing is the same issue from Ninad Shah.
https://www.postgresql.org/message-id/CAOFEiBd9j620TsBZPT0%2BuvdemQqwTrCLohcLjuDfQ2ye-xdswQ%40mail.gmail.com

Regards,
Bo Chenbo

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message Masahiko Sawada 2021-12-27 04:53:33 Re: BUG #17345: pg_basebackup stucked for 2 hours before timeout
Previous Message Dmitry Dolgov 2021-12-25 23:26:25 Re: BUG #17344: Assert failed on queiring async_capable foreign table with inheritance