RE: SIGQUIT on archiver child processes maybe not such a hot idea?

From: "Tsunakawa, Takayuki" <tsunakawa(dot)takay(at)jp(dot)fujitsu(dot)com>
To: 'Tom Lane' <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: "pgsql-hackers(at)lists(dot)postgresql(dot)org" <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: RE: SIGQUIT on archiver child processes maybe not such a hot idea?
Date: 2019-09-02 00:27:09
Message-ID: 0A3221C70F24FB45833433255569204D1FD0B676@G01JPEXMBYT05
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

From: Tom Lane [mailto:tgl(at)sss(dot)pgh(dot)pa(dot)us]
> After investigation, the mechanism that's causing that is that the
> src/test/recovery/t/010_logical_decoding_timelines.pl test shuts
> down its replica server with a mode-immediate stop, which causes
> that postmaster to shut down all its children with SIGQUIT, and
> in particular that signal propagates to a "cp" command that the
> archiver process is executing. The "cp" is unsurprisingly running
> with default SIGQUIT handling, which per the signal man page
> includes dumping core.

We've experienced this (core dump in the data directory by an archive command) years ago. Related to this, the example of using cp in the PostgreSQL manual is misleading, because cp doesn't reliably persist the WAL archive file.

> This makes me wonder whether we shouldn't be using some other signal
> to shut down archiver subprocesses. It's not real cool if we're
> spewing cores all over the place. Admittedly, production servers
> are likely running with "ulimit -c 0" on most modern platforms,
> so this might not be a huge problem in the field; but accumulation
> of core files could be a problem anywhere that's configured to allow
> server core dumps.

We enable the core dump in production to help the investigation just in case.

> Ideally, perhaps, we'd be using SIGINT not SIGQUIT to shut down
> non-Postgres child processes. But redesigning the system's signal
> handling to make that possible seems like a bit of a mess.
>
> Thoughts?

We're using a shell script and a command that's called in the shell script. That is:

archive_command = 'call some_shell_script.sh ...'

[some_shell_script.sh]
ulimit -c 0
trap SIGQUIT to just exit on the receipt of the signal
call some_command to copy file

some_command also catches SIGQUIT just exit. It copies and syncs the file.

I proposed something in this line as below, but I couldn't respond to Peter's review comments due to other tasks. Does anyone think it's worth resuming this?

https://www.postgresql.org/message-id/7E37040CF3804EA5B018D7A022822984@maumau

Regards
Takayuki Tsunakawa

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Michael Paquier 2019-09-02 00:51:01 Re: refactoring - share str2*int64 functions
Previous Message Euler Taveira 2019-09-01 23:43:57 Re: row filtering for logical replication