Re: Parallel worker hangs while handling errors.

From: vignesh C <vignesh21(at)gmail(dot)com>
To: Bharath Rupireddy <bharath(dot)rupireddyforpostgres(at)gmail(dot)com>
Cc: PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Robert Haas <robertmhaas(at)gmail(dot)com>
Subject: Re: Parallel worker hangs while handling errors.
Date: 2020-07-25 01:32:32
Message-ID: CALDaNm04E=fbixNSqEeb4PWGdeYY26OZ-4bXNayKR2QoV6i=Yg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Fri, Jul 24, 2020 at 12:41 PM Bharath Rupireddy
<bharath(dot)rupireddyforpostgres(at)gmail(dot)com> wrote:
>
> On Thu, Jul 23, 2020 at 12:54 PM vignesh C <vignesh21(at)gmail(dot)com> wrote:
> >
> > Thanks for reviewing and adding your thoughts, My comments are inline.
> >
> > On Fri, Jul 17, 2020 at 1:21 PM Bharath Rupireddy
> > <bharath(dot)rupireddyforpostgres(at)gmail(dot)com> wrote:
> > >
> > > The same hang issue can occur(though I'm not able to back it up with a
> > > use case), in the cases from wherever the EmitErrorReport() is called
> > > from "if (sigsetjmp(local_sigjmp_buf, 1) != 0)" block, such as
> > > autovacuum.c, bgwriter.c, bgworker.c, checkpointer.c, walwriter.c, and
> > > postgres.c.
> > >
> >
> > I'm not sure if this can occur in other cases.
> >
>
> I checked further on this point: Yes, it can't occur for the other
> cases, as mq_putmessage() gets only used for parallel
> workers(ParallelWorkerMain() --> pq_redirect_to_shm_mq()).
>
> >
> > > Note that, in all sigsetjmp blocks, we intentionally
> > > HOLD_INTERRUPTS(), to not cause any issues while performing error
> > > handling, I'm concerned here that now, if we directly call
> > > BackgroundWorkerUnblockSignals() which will open up all the signals
> > > and our main intention of holding interrupts or block signals may go
> > > away.
> > >
> > > Since the main problem for this hang issue is because of blocking
> > > SIGUSR1, in sigsetjmp, can't we just only unblock only the SIGUSR1,
> > > instead of unblocking all signals? I tried this with parallel copy
> > > hang, the issue is resolved.
> > >
> >
> > On putting further thoughts on this, I feel just unblocking SIGUSR1
> > would be the right approach in this case. I'm attaching a new patch
> > which unblocks SIGUSR1 signal. I have verified that the original issue
> > with WIP parallel copy patch gets fixed. I have made changes only in
> > bgworker.c as we require the parallel worker to receive this signal
> > and continue processing. I have not included the changes for other
> > processes as I'm not sure if this scenario is applicable for other
> > processes.
> >
>
> Since we are sure that this hang issue can occur only for parallel
> workers, and the change in StartBackgroundWorker's sigsetjmp's block
> should only be made for parallel worker cases. And also there can be a
> lot of other callbacks execution and processing in proc_exit() for
> which we might not need the SIGUSR1 unblocked. So, let's undo the
> unblocking right after EmitErrorReport() to not cause any new issues.
>
> Attaching a modified v2 patch: it has the unblocking for only for
> parallel workers, undoing it after EmitErrorReport(), and some
> adjustments in the comment.
>

I have made slight changes on top of the patch to remove duplicate
code, attached v3 patch for the same.
The parallel worker hang issue gets resolved, make check & make
check-world passes.

Regards,
Vignesh
EnterpriseDB: http://www.enterprisedb.com

Attachment Content-Type Size
v3-0001-Fix-for-Parallel-worker-hangs-while-handling-erro.patch text/x-patch 2.8 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message vignesh C 2020-07-25 01:52:15 Re: handle a ECPG_bytea typo
Previous Message vignesh C 2020-07-25 00:41:33 Re: Include access method in listTables output