Quick Links

Re: What to do when dynamic shared memory control segment is corrupt

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Sherrylyn Branchaw <sbranchaw(at)gmail(dot)com>
Cc:	pg(at)bowt(dot)ie, andres(at)anarazel(dot)de, pgsql-general(at)postgresql(dot)org
Subject:	Re: What to do when dynamic shared memory control segment is corrupt
Date:	2018-06-19 03:40:12
Message-ID:	13596.1529379612@sss.pgh.pa.us
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-general

Sherrylyn Branchaw <sbranchaw(at)gmail(dot)com> writes:
>> Hm ... were these installations built with --enable-cassert? If not,
>> an abort trap seems pretty odd.

> The packages are installed directly from the yum repos for RHEL. I'm not
> aware that --enable-cassert is being used, and we're certainly not
> installing from source.

OK, I'm pretty sure nobody builds production RPMs with --enable-cassert.
But your extensions (as listed below) don't include any C++ code, so
that still leaves us wondering where the abort trap came from. A stack
trace would almost certainly help clear that up.

>> Those "incomplete data" messages are quite unexpected and disturbing.

> We're using the stock initd script from the yum repo, but I dug into this
> issue, and it looks like we're passing the path to the postmaster.pid as
> the $pidfile variable in our sysconfig file, meaning the initd script is
> managing the postmaster.pid file, and specifically is overwriting it with a
> single line containing just the pid. I'm not sure why it's set up like
> this, and I'm thinking we should change it, but it seems harmless and
> unrelated to the crash. In particular, manual initd actions such as stop,
> start, restart, and status all work fine.

This is bad; a normal postmaster.pid file contains half a dozen lines
besides the PID proper. You might get away with this for now, but it'll
break pg_ctl as of v10 or so, and might confuse other external tools
sooner than that. Still, it doesn't seem related to your crash problem.

>> No, that looks like fairly typical crash recovery to me: corrupt shared
>> memory contents are expected and recovered from after a crash.

> That's reassuring. But if it's safe for us to immediately start the server
> back up, why did Postgres not automatically start the server up like it did
> the first time?

Yeah, I'd like to know that too. The complaint about corrupt shared
memory may be just an unrelated red herring, or it might be a separate
effect of whatever the primary failure was ... but I think it was likely
not the direct cause of the failure-to-restart. But we've got no real
evidence as to what that direct cause was.

> At any rate, if it's safe, we can write a script to detect this failure
> mode and automatically restart, although it would be less error-prone if
> Postgres restarted automatically.

I realize that you're most focused on less-downtime, but from my
perspective it'd be good to worry about collecting evidence as to
what happened exactly. Capturing core files is a good start --- and
don't forget the possibility that there's more than one. A plausible
guess as to why the system didn't restart is that the postmaster crashed
too, so we'd need to see its core to figure out why.

Anyway, I would not be afraid to try restarting the postmaster manually
if it died. Maybe don't do that repeatedly without human intervention;
but PG is pretty robust against crashes. We developers crash it all the
time, and we don't lose data.

regards, tom lane

In response to

Re: What to do when dynamic shared memory control segment is corrupt at 2018-06-18 23:50:04 from Sherrylyn Branchaw

Responses

Re: What to do when dynamic shared memory control segment is corrupt at 2018-06-19 15:43:25 from Sherrylyn Branchaw

Browse pgsql-general by date

	From	Date	Subject
Next Message	Łukasz Jarych	2018-06-19 03:51:37	Re: Run Stored procedure - function from VBA
Previous Message	Benjamin Scherrey	2018-06-19 03:15:52	Re: High WriteLatency RDS Postgres 9.3.20