Re: Segfault leading to crash, recovery mode, and TOAST corruption

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Jonathan Marks <jonathanaverymarks(at)gmail(dot)com>
Cc: pgsql-general(at)lists(dot)postgresql(dot)org
Subject: Re: Segfault leading to crash, recovery mode, and TOAST corruption
Date: 2018-06-06 00:07:31
Message-ID: 25981.1528243651@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

Jonathan Marks <jonathanaverymarks(at)gmail(dot)com> writes:
> We had two issues today (once this morning and once a few minutes ago)
> with our primary database (RDS running 10.1, 32 cores, 240 GB RAM, 5TB
> total disk space, 20k PIOPS) where the database suddenly crashed and
> went into recovery mode.

I'd suggest updating to 10.4 ... see below.

> Both times that the server crashed, we saw this in the logs:
> 2018-06-05 23:08:44 UTC:172.31.7.89(36224):production(at)OURDB:[12173]:ERROR: canceling statement due to statement timeout
> 2018-06-05 23:08:44 UTC::@:[48863]:LOG: worker process: parallel worker for PID 12173 (PID 20238) exited with exit code 1
> 2018-06-05 23:08:49 UTC::@:[48863]:LOG: server process (PID 12173) was terminated by signal 11: Segmentation fault

This looks to be a parallel leader process getting confused when a worker
process exits unexpectedly. There were some related fixes in 10.2, which
might resolve the issue, though it's also possible we have more to do there.

> After the first crash, we then started getting errors like:
> 2018-06-05 23:08:45 UTC:172.31.6.84(33392):production(at)OURDB:[11888]:ERROR: unexpected chunk number 0 (expected 1) for toast value 1592283014 in pg_toast_26656

This definitely looks to be the "reuse of TOAST OIDs immediately after
crash" issue that was fixed in 10.4. AFAIK it's recoverable corruption;
I believe you'll find that VACUUMing the parent table will make the
errors stop, and all will be well. But an update would be prudent to
prevent it from happening again.

regards, tom lane

In response to

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Adrian Klaver 2018-06-06 00:09:16 Re: Pgagent is not reading pgpass file either in Windows or Linux.
Previous Message Jan Claeys 2018-06-06 00:07:06 Re: Code of Conduct plan