ERROR: unexpected chunk number 0 (expected 1) for toast value 76753264 in pg_toast_10920100

From: ADSJ (Adam Sjøgren) <adsj(at)novozymes(dot)com>
To: <pgsql-general(at)postgresql(dot)org>
Subject: ERROR: unexpected chunk number 0 (expected 1) for toast value 76753264 in pg_toast_10920100
Date: 2017-06-07 13:33:26
Message-ID: 7pefuv53dl.fsf@novozymes.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

Our database has started reporting errors like this:

2017-05-31 13:48:10 CEST ERROR: unexpected chunk number 0 (expected 1) for toast value 14242189 in pg_toast_10919630
...
2017-06-01 11:06:56 CEST ERROR: unexpected chunk number 0 (expected 1) for toast value 19573520 in pg_toast_10919630

(157 times, for different toast values, same pg_toast_nnn). pg_toast_10919630
corresponds to a table with around 168 million rows.

These went away, but the next day we got similar errors from another
table:

2017-06-02 05:59:50 CEST ERROR: unexpected chunk number 0 (expected 1) for toast value 47060150 in pg_toast_10920100
...
2017-06-02 06:14:54 CEST ERROR: unexpected chunk number 0 (expected 1) for toast value 47226455 in pg_toast_10920100

(Only 4 this time) pg_toast_10920100 corresponds to a table with holds
around 320 million rows (these are our two large tables).

The next day we got 6 such errors and the day after 10 such errors. On
June 5th we got 94, yesterday we got 111, of which one looked a little
different:

2017-06-06 17:32:21 CEST ERROR: unexpected chunk size 1996 (expected 1585) in final chunk 0 for toast value 114925100 in pg_toast_10920100

and today the logs have 65 lines, ending with these:

2017-06-07 14:49:53 CEST ERROR: unexpected chunk number 0 (expected 1) for toast value 131114834 in pg_toast_10920100
2017-06-07 14:53:41 CEST ERROR: unexpected chunk number 0 (expected 1) for toast value 131149566 in pg_toast_10920100

The database is 10 TB on disk (SSDs) and runs on a 48 core server with 3
TB RAM on Ubuntu 14.04 (Linux 3.18.13).

We are updating rows in the database a lot/continuously.

There are no apparent indications of hardware errors (like ECC) in
dmesg, nor any error messages logged by the LSI MegaRAID controller, as
far as I can tell.

We are running PostgreSQL 9.3.14 currently.

The only thing I could see in the release notes since 9.3.14 that might
be related is this:

"* Avoid very-low-probability data corruption due to testing tuple
visibility without holding buffer lock (Thomas Munro, Peter Geoghegan,
Tom Lane)"

Although reading more about it, it doesn't sound like it would exhibit
the symptoms we see?

We have recently increased the load (to around twice the number of
cores), though, which made me think we could be triggering corner cases
we haven't hit before.

We will be upgrading to PostgreSQL 9.3.17 during the weekend, but I'd like to hear
if anyone has seen something like this, or have some ideas of how to
investigate/what the cause might be.

Best regards,

Adam

--
"Lägg ditt liv i min hand Adam Sjøgren
Sälj din själ till ett band" adsj(at)novozymes(dot)com

Responses

Browse pgsql-general by date

  From Date Subject
Next Message David Rosenstrauch 2017-06-07 14:16:20 Advisory lock deadlock issue
Previous Message Jan de Visser 2017-06-07 12:24:49 Re: Extract from text id other table - PG 9.1