Re: [HACKERS] Why does logical replication launcher exit with exit code 1?

From: Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [HACKERS] Why does logical replication launcher exit with exit code 1?
Date: 2018-10-04 23:36:26
Message-ID: CAEepm=0MGgKXQ_-M5xFmQS5a70JCD5xFd31VCAEAUrnL8OQ2Qw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Thu, Aug 3, 2017 at 1:41 PM Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> On Wed, Aug 2, 2017 at 9:33 PM, Peter Eisentraut
> <peter(dot)eisentraut(at)2ndquadrant(dot)com> wrote:
> > On 8/2/17 16:52, Robert Haas wrote:
> >> I actually don't think it's that unreasonable to get notified when
> >> system-wide processes like the autovacuum launcher or the logical
> >> replication launcher start or stop.
> >
> > But we got rid of those start messages recently due to complaints.
>
> Yeah, true. I'm just talking about what *I* think. :-)

I still think the current situation is non-ideal. I don't have a
strong view on whether some or all system-wide processes should say
hello and goodbye explicitly in the log, but I do think we need a way
to make that not look like an error condition, and ideally without
special cases for known built-in processes.

I looked into this a bit today, while debugging an extension that runs
a background worker. Here are some assorted approaches to shutdown:

0. The default SIGTERM handler for bgworkers is bgworker_die(), which
generates a FATAL ereport "terminating background worker \"%s\" due to
administrator command", directly in the signal handler (hmm, see also
recent discussions about the legality of similar code in quickdie()).

1. Some processes install their own custom SIGTERM handler that sets
a flag, and use that to break out of their main loop and exit quietly.
Example: autovacuum.c, or the open-source pg_cron extension, and
probably many other things that just want a nice quiet shutdown.

2. Some processes install the standard SIGTERM handler die(), and
then use CHECK_FOR_INTERRUPTS() to break out of their main loop. By
default this looks like "FATAL: terminating connection due to
administrator command". This approach is effectively required if the
main loop will reach other code that has a CHECK_FOR_INTERRUPTS() wait
loop. For example, a "launcher"-type process that calls
WaitForBackgroundWorkerStartup() could hang forever if it used
approach 1 (ask me how I know).

3. Some system processes generally use approach 2, but have a special
case in ProcessInterrupts() to suppress or alter the usual FATAL
message or behaviour. This includes the logical replication launcher.

A couple of thoughts:

* Extensions that need to use die() (or something else that would be
compatible with CFI() wait loops) should ideally have a way to control
how ProcessInterrupts() reports this to the user, since shutdown is an
expected condition for long-lived processes.

* We really should get rid of that "exited with exit code 1".

--
Thomas Munro
http://www.enterprisedb.com

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Andres Freund 2018-10-04 23:54:45 Re: Segfault when creating partition with a primary key and sql_drop trigger exists
Previous Message Michael Paquier 2018-10-04 23:29:29 Re: Segfault when creating partition with a primary key and sql_drop trigger exists