Re: Occasional 9.6.10 PMChildFlags fatal error, possibly due to >2 parallel gathers

From: Chris Snook <csnook(at)cloudflare(dot)com>
To: Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>
Cc: PostgreSQL Bugs <pgsql-bugs(at)postgresql(dot)org>
Subject: Re: Occasional 9.6.10 PMChildFlags fatal error, possibly due to >2 parallel gathers
Date: 2019-02-13 06:30:10
Message-ID: CAONUJSNGCLW1GSXh18raY6bvqBiDkqfLyKpy-QeSvZTx3SrHhA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

There was an idle psql session running in screen, invoked as sudo -u
postgres psql. Salt is also routinely running a bunch of configuration
assertion checks as the postgres user, but those are not login sessions
either, and have been running sub-hourly for over a year without incident.
Backups run from a replica, and this failure happened on the primary, and
not proximal to a backup run. Because we're using stock Debian Stretch
packages, that user is a system user (UID 110, GID 114), so that behavior
wouldn't apply in this case.

If we can figure out how to reproduce it reliably outside of production,
I'll turn all the logging options up to 11 so we can figure out if the
shared memory error is immediately following the fatal error in the same
process, or just a cleanup race as everything is shutting down. We haven't
had a recurrence with max_parallel_workers_per_gather set to 2, but we also
went for several hours after the two failures that were 63 minutes apart
with it still set to 10, and it didn't reproduce in that time either, so
that doesn't mean much.

- Chris

On Tue, Feb 12, 2019 at 9:55 PM Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>
wrote:

> On Wed, Feb 13, 2019 at 3:41 PM Chris Snook <csnook(at)cloudflare(dot)com> wrote:
> > For more context, I got these tightly packed around the first crash,
> with the first and last messages repeated hundreds of times:
> >
> > FATAL: sorry, too many clients already
> > FATAL: sorry, too many clients already
> > FATAL: sorry, too many clients already
> > FATAL: no free slots in PMChildFlags array
> > WARNING: could not remove shared memory segment
> "/PostgreSQL.1407760088": No such file or directory
> > FATAL: semop(id=2293786) failed: Invalid argument
> > FATAL: semop(id=2293786) failed: Invalid argument
> > FATAL: semctl(2064403, 7, SETVAL, 0) failed: Invalid argument
> > FATAL: semop(id=2621476) failed: Invalid argument
> > FATAL: semop(id=2621476) failed: Invalid argument
> > FATAL: semctl(2293786, 1, SETVAL, 0) failed: Invalid argument
> > FATAL: semctl(2621476, 10, SETVAL, 0) failed: Invalid argument
> > WARNING: could not remove shared memory segment
> "/PostgreSQL.1621779631": No such file or directory
>
> Any chance you created a cronjob that runs as user "postgres" (or
> whatever user the PostgreSQL cluster runs as), or logged in as that
> user manually for some reason? Systemd likes to blow away global IPC
> resources associated with users when they log out.
>
> https://www.postgresql.org/docs/11/kernel-resources.html#SYSTEMD-REMOVEIPC
>
> --
> Thomas Munro
> http://www.enterprisedb.com
>

In response to

Browse pgsql-bugs by date

  From Date Subject
Next Message Peter Geoghegan 2019-02-13 06:51:05 Re: BUG #15609: synchronous_commit=off insert performance regression with secondary indexes
Previous Message Thomas Munro 2019-02-13 05:54:47 Re: Occasional 9.6.10 PMChildFlags fatal error, possibly due to >2 parallel gathers