From: | Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> |
---|---|
To: | Sherrylyn Branchaw <sbranchaw(at)gmail(dot)com> |
Cc: | pg(at)bowt(dot)ie, andres(at)anarazel(dot)de, pgsql-general(at)postgresql(dot)org |
Subject: | Re: What to do when dynamic shared memory control segment is corrupt |
Date: | 2018-06-19 03:40:12 |
Message-ID: | 13596.1529379612@sss.pgh.pa.us |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-general |
Sherrylyn Branchaw <sbranchaw(at)gmail(dot)com> writes:
>> Hm ... were these installations built with --enable-cassert? If not,
>> an abort trap seems pretty odd.
> The packages are installed directly from the yum repos for RHEL. I'm not
> aware that --enable-cassert is being used, and we're certainly not
> installing from source.
OK, I'm pretty sure nobody builds production RPMs with --enable-cassert.
But your extensions (as listed below) don't include any C++ code, so
that still leaves us wondering where the abort trap came from. A stack
trace would almost certainly help clear that up.
>> Those "incomplete data" messages are quite unexpected and disturbing.
> We're using the stock initd script from the yum repo, but I dug into this
> issue, and it looks like we're passing the path to the postmaster.pid as
> the $pidfile variable in our sysconfig file, meaning the initd script is
> managing the postmaster.pid file, and specifically is overwriting it with a
> single line containing just the pid. I'm not sure why it's set up like
> this, and I'm thinking we should change it, but it seems harmless and
> unrelated to the crash. In particular, manual initd actions such as stop,
> start, restart, and status all work fine.
This is bad; a normal postmaster.pid file contains half a dozen lines
besides the PID proper. You might get away with this for now, but it'll
break pg_ctl as of v10 or so, and might confuse other external tools
sooner than that. Still, it doesn't seem related to your crash problem.
>> No, that looks like fairly typical crash recovery to me: corrupt shared
>> memory contents are expected and recovered from after a crash.
> That's reassuring. But if it's safe for us to immediately start the server
> back up, why did Postgres not automatically start the server up like it did
> the first time?
Yeah, I'd like to know that too. The complaint about corrupt shared
memory may be just an unrelated red herring, or it might be a separate
effect of whatever the primary failure was ... but I think it was likely
not the direct cause of the failure-to-restart. But we've got no real
evidence as to what that direct cause was.
> At any rate, if it's safe, we can write a script to detect this failure
> mode and automatically restart, although it would be less error-prone if
> Postgres restarted automatically.
I realize that you're most focused on less-downtime, but from my
perspective it'd be good to worry about collecting evidence as to
what happened exactly. Capturing core files is a good start --- and
don't forget the possibility that there's more than one. A plausible
guess as to why the system didn't restart is that the postmaster crashed
too, so we'd need to see its core to figure out why.
Anyway, I would not be afraid to try restarting the postmaster manually
if it died. Maybe don't do that repeatedly without human intervention;
but PG is pretty robust against crashes. We developers crash it all the
time, and we don't lose data.
regards, tom lane
From | Date | Subject | |
---|---|---|---|
Next Message | Łukasz Jarych | 2018-06-19 03:51:37 | Re: Run Stored procedure - function from VBA |
Previous Message | Benjamin Scherrey | 2018-06-19 03:15:52 | Re: High WriteLatency RDS Postgres 9.3.20 |