BUG #13818: PostgreSQL crashes after cronjob runs as "postgres"

From: ryan(at)runscope(dot)com
To: pgsql-bugs(at)postgresql(dot)org
Subject: BUG #13818: PostgreSQL crashes after cronjob runs as "postgres"
Date: 2015-12-14 22:47:29
Message-ID: 20151214224729.2624.99840@wrigleys.postgresql.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

The following bug has been logged on the website:

Bug reference: 13818
Logged by: Ryan Park
Email address: ryan(at)runscope(dot)com
PostgreSQL version: 9.4.5
Operating system: CentOS 7.2
Description:

We are running PostgreSQL 9.4.5 on CentOS 7, uising the RPM from
postgresql.org. We perform automatic OS updates, so our server was just
upgraded from CentOS 7.1 to CentOS 7.2. Since the upgrade we've been seeing
PostgreSQL crash after a cronjob is run as the "postgres" user.

More details: we have a cronjob that runs every 3 minutes. This cronjob runs
as the OS user "postgres". It makes a very simple database update using
PSQL, like this:
*/3 * * * * /usr/pgsql-9.4/bin/psql -U <username> <dbname> -c 'UPDATE
patches SET name = name WHERE patch = 1'

Here's the syslog output from /var/log/messages when the cronjob runs:

Dec 14 18:24:01 prod118 systemd: Created slice user-1012.slice.
Dec 14 18:24:01 prod118 systemd: Starting user-1012.slice.
Dec 14 18:24:01 prod118 systemd: Started Session 15015 of user
postgres.
Dec 14 18:24:01 prod118 systemd: Starting Session 15015 of user
postgres.
Dec 14 18:24:01 prod118 systemd: Removed slice user-1012.slice.
Dec 14 18:24:01 prod118 systemd: Stopping user-1012.slice.
Dec 14 18:24:01 prod118 postgres[52611]: [2584-1] PANIC:
semop(id=57311261) failed: Invalid argument
Dec 14 18:24:01 prod118 postgres[52611]: [2584-2] STATEMENT: COMMIT
Dec 14 18:24:01 prod118 postgres[52584]: [2584-1] PANIC:
semop(id=57376799) failed: Invalid argument
Dec 14 18:24:01 prod118 postgres[52584]: [2584-2] STATEMENT: COMMIT
Dec 14 18:24:01 prod118 postgres[52630]: [2584-1] PANIC:
semop(id=57311261) failed: Invalid argument
Dec 14 18:24:01 prod118 postgres[52630]: [2584-2] STATEMENT: COMMIT
Dec 14 18:24:01 prod118 postgres[72998]: [2600-1] LOG: server process
(PID 52630) was terminated by signal 6: Aborted
Dec 14 18:24:01 prod118 postgres[72998]: [2600-2] DETAIL: Failed
process was running: COMMIT
Dec 14 18:24:01 prod118 postgres[72998]: [2601-1] LOG: terminating any
other active server processes
Dec 14 18:24:01 prod118 postgres[52711]: [2585-1] WARNING: terminating
connection because of crash of another server process
Dec 14 18:24:01 prod118 postgres[52711]: [2585-2] DETAIL: The
postmaster has commanded this server process to roll back the current
transaction and exit, because another server process exited abnormally and
possibly corrupted shared memory.
Dec 14 18:24:01 prod118 postgres[52711]: [2585-3] HINT: In a moment you
should be able to reconnect to the database and repeat your command.
Dec 14 18:24:01 prod118 postgres[52715]: [2585-1] WARNING: terminating
connection because of crash of another server process
Dec 14 18:24:01 prod118 postgres[52714]: [2585-1] WARNING: terminating
connection because of crash of another server process
Dec 14 18:24:01 prod118 postgres[52715]: [2585-2] DETAIL: The
postmaster has commanded this server process to roll back the current
transaction and exit, because another server process exited abnormally and
possibly corrupted shared memory.
Dec 14 18:24:01 prod118 postgres[52714]: [2585-2] DETAIL: The
postmaster has commanded this server process to roll back the current
transaction and exit, because another server process exited abnormally and
possibly corrupted shared memory.
Dec 14 18:24:01 prod118 postgres[52715]: [2585-3] HINT: In a moment you
should be able to reconnect to the database and repeat your command.
Dec 14 18:24:01 prod118 postgres[52714]: [2585-3] HINT: In a moment you
should be able to reconnect to the database and repeat your command.
... and so on.

It looks like systemd created a "slice" for the cronjob. A "slice" is a
partition of system resources using cgroups, kinda like a Docker container.
When the cronjob finished, systemd tore down the slice. My hunch is that
when the slice was stopped, it introduced some corruption in shared memory
or another shared resource. After the slice was stopped, we immediately saw
3 semop failures in PostgreSQL, which caused the entire Postgres server to
crash. The initial postgres process (/usr/pgsql-9.4/bin/postgres) kept
running, but all the subprocesses were restarted -- including the
checkpointer, writer, etc. This happened over and over, every time the
cronjob ran.

We tried restarting PostgreSQL and restarting the whole server, but neither
of those helped.

As I mentioned, our cronjob ran as the "postgres" user. When I switched the
cronjob to run from a different user account, everything worked fine and
Postgres stopped crashing. I'm guessing that there are fewer shared
resources between 2 different users, so nothing gets corrupt when the other
user's slice is stopped.

Exact sequence of steps from startup:
1. Create a database called "<dbname>" with a "patches" table. This is
the schema:
CREATE TABLE IF NOT EXISTS patches(
patch integer NOT NULL,
name character varying(50),
successful boolean NOT NULL,
applied_by character varying(1000),
applied_at timestamp without time zone NOT NULL,
CONSTRAINT patches_pkey PRIMARY KEY (patch)
) WITH (OIDS=FALSE);
2. Make sure the user "<username>" has permission to write to that
table.
3. Create a cronjob like the one described above. It should be in the
"postgres" user's crontab.
*/3 * * * * /usr/pgsql-9.4/bin/psql -U <username> <dbname> -c
'UPDATE patches SET name = name WHERE patch = 1'

What we got: PostgreSQL crashed with the logs listed above.

What we expected: PostgreSQL would not crash.

PostgreSQL version: Both server and client are version 9.4.5. We are using
these RPMs:
postgresql94-9.4.5-1PGDG.rhel7.x86_64
postgresql94-contrib-9.4.5-1PGDG.rhel7.x86_64
postgresql94-server-9.4.5-1PGDG.rhel7.x86_64
postgresql94-libs-9.4.5-1PGDG.rhel7.x86_64

Platform information: CentOS 7.2 on x86_64. The issue happened on both Linux
kernel 3.10.0-229.20.1.el7.x86_64 and on kernel 3.10.0-327.3.1.el7.x86_64.
The version of systemd is 219. This is running on Amazon EC2, on an
i2.4xlarge instance.

Thank you for your help.

Ryan Park
Runscope Operations
ryan(at)runscope(dot)com

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message Tom Lane 2015-12-14 23:11:15 Re: pg_rewind exiting with error code 1 when source and target are on the same timeline
Previous Message Heikki Linnakangas 2015-12-14 17:11:43 Re: PQexec() hangs on OOM