Re: Segfault leading to crash, recovery mode, and TOAST corruption

From: Jonathan Marks <jonathanaverymarks(at)gmail(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: pgsql-general(at)lists(dot)postgresql(dot)org
Subject: Re: Segfault leading to crash, recovery mode, and TOAST corruption
Date: 2018-06-06 00:36:00
Message-ID: CF9FED80-6E1A-47F0-969B-B3E4757BFC2B@gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

Thank you so very much, Tom.

Vacuuming fixed the TOAST corruption issue and we’ll upgrade our instances tonight (max RDS has is 10.3, but that’s a start).

> On Jun 5, 2018, at 8:07 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>
> Jonathan Marks <jonathanaverymarks(at)gmail(dot)com> writes:
>> We had two issues today (once this morning and once a few minutes ago)
>> with our primary database (RDS running 10.1, 32 cores, 240 GB RAM, 5TB
>> total disk space, 20k PIOPS) where the database suddenly crashed and
>> went into recovery mode.
>
> I'd suggest updating to 10.4 ... see below.
>
>> Both times that the server crashed, we saw this in the logs:
>> 2018-06-05 23:08:44 UTC:172.31.7.89(36224):production(at)OURDB:[12173]:ERROR: canceling statement due to statement timeout
>> 2018-06-05 23:08:44 UTC::@:[48863]:LOG: worker process: parallel worker for PID 12173 (PID 20238) exited with exit code 1
>> 2018-06-05 23:08:49 UTC::@:[48863]:LOG: server process (PID 12173) was terminated by signal 11: Segmentation fault
>
> This looks to be a parallel leader process getting confused when a worker
> process exits unexpectedly. There were some related fixes in 10.2, which
> might resolve the issue, though it's also possible we have more to do there.
>
>> After the first crash, we then started getting errors like:
>> 2018-06-05 23:08:45 UTC:172.31.6.84(33392):production(at)OURDB:[11888]:ERROR: unexpected chunk number 0 (expected 1) for toast value 1592283014 in pg_toast_26656
>
> This definitely looks to be the "reuse of TOAST OIDs immediately after
> crash" issue that was fixed in 10.4. AFAIK it's recoverable corruption;
> I believe you'll find that VACUUMing the parent table will make the
> errors stop, and all will be well. But an update would be prudent to
> prevent it from happening again.
>
> regards, tom lane

In response to

Browse pgsql-general by date

  From Date Subject
Next Message Tom Lane 2018-06-06 00:38:41 Re: Code of Conduct plan
Previous Message Christophe Pettus 2018-06-06 00:15:53 Re: Code of Conduct plan