Re: background worker shudown (was Re: [HACKERS] Why does logical replication launcher exit with exit code 1?)

From: Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: background worker shudown (was Re: [HACKERS] Why does logical replication launcher exit with exit code 1?)
Date: 2018-10-10 03:59:29
Message-ID: CAEepm=10MtmKeDc1WxBM0PQM9OgtNy+RCeWqz40pZRRS3PNo5Q@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, Oct 10, 2018 at 7:29 AM Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> On Thu, Oct 4, 2018 at 7:37 PM Thomas Munro
> <thomas(dot)munro(at)enterprisedb(dot)com> wrote:
> > 0. The default SIGTERM handler for bgworkers is bgworker_die(), which
> > generates a FATAL ereport "terminating background worker \"%s\" due to
> > administrator command", directly in the signal handler (hmm, see also
> > recent discussions about the legality of similar code in quickdie()).
>
> Yeah, I think that code is bad news, for the same reasons discussed
> with regard to quickdie().

So I suppose we should just remove it, with something like 0002. I'm
a bit uneasy about existing code out there that might be not calling
CFI. OTOH I suspect that a lot of code copied worker_spi.c and
installed its own handler.

> > 1. Some processes install their own custom SIGTERM handler that sets
> > a flag, and use that to break out of their main loop and exit quietly.
> > Example: autovacuum.c, or the open-source pg_cron extension, and
> > probably many other things that just want a nice quiet shutdown.
> >
> > 2. Some processes install the standard SIGTERM handler die(), and
> > then use CHECK_FOR_INTERRUPTS() to break out of their main loop. By
> > default this looks like "FATAL: terminating connection due to
> > administrator command". This approach is effectively required if the
> > main loop will reach other code that has a CHECK_FOR_INTERRUPTS() wait
> > loop. For example, a "launcher"-type process that calls
> > WaitForBackgroundWorkerStartup() could hang forever if it used
> > approach 1 (ask me how I know).
>
> My experience with background workers has been that approach #1 does
> not really work. I mean, if you aren't doing anything complicated it
> might be OK, if for example you are executing SQL queries, and you try
> to do #1, then your SQL queries don't respond to interrupts. And that
> sucks. So I have generally adopted approach #2.

Maybe src/test/modules/worker_spi.c shouldn't use approach #1 (even if
it might technically be OK for that code)? I think I might have been
guilty of copying that.

> To put that another way, nearly everything can reach
> CHECK_FOR_INTERRUPTS(), so saying that this is required whenever that
> in the case isn't much different from saying that it is required, full
> stop.

> > 3. Some system processes generally use approach 2, but have a special
> > case in ProcessInterrupts() to suppress or alter the usual FATAL
> > message or behaviour. This includes the logical replication launcher.
>
> Maybe we could replace this by a general-purpose hook. So instead of
> the various tests for process types that are there now, we would just
> have
>
> if (procdie_hook != NULL)
> (*procdie_hook)();
>
> And that hook can do whatever you like (but probably including dying).

Ok, patch 0001 is a sketch like that, for discussion.

--
Thomas Munro
http://www.enterprisedb.com

Attachment Content-Type Size
0001-Add-proc_die_hook-to-customize-die-interrupt-handlin.patch application/octet-stream 3.1 KB
0002-Remove-async-signal-unsafe-bgworker_die-function.patch application/octet-stream 1.7 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Michael Paquier 2018-10-10 04:01:50 Re: partition tree inspection functions
Previous Message Amit Langote 2018-10-10 02:54:48 Re: partition tree inspection functions