Re: ERROR: unexpected chunk number 0 (expected 1) for toast value 76753264 in pg_toast_10920100

From: Achilleas Mantzios <achill(at)matrix(dot)gatewaynet(dot)com>
To: pgsql-general(at)postgresql(dot)org
Subject: Re: ERROR: unexpected chunk number 0 (expected 1) for toast value 76753264 in pg_toast_10920100
Date: 2017-06-07 15:01:49
Message-ID: 8c975456-a717-2321-a135-f8629fe78602@matrix.gatewaynet.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

On 07/06/2017 17:49, Harry Ambrose wrote:
> Hi,
> Out of interest, are you using any tablespaces other than pg_default? I can only replicate the issue when using separately mounted tablespaces.
One lesson I learned from the BSD camp when dealing with random freezes and panics : when all else fails to give an answer it is time to start blaming my hardware. Are those tablespaces on any cheap
SSD's ?
>
> I have been investigating this quite extensively and everything I can find on the web suggests data corruption. However running the the following DO reports no errors and I can dump the database
> without issue.
>
You don't use index when pg_dump . If only the index is corrupted you can get away with dump/reload (but for big DBs this is unrealistic)
> I also found the following has been reported: https://www.postgresql.org/message-id/20161201165505.4360.28203@wrigleys.postgresql.org
>
> Best wishes,
> Harry
>
>> On 7 Jun 2017, at 15:22, Achilleas Mantzios <achill(at)matrix(dot)gatewaynet(dot)com <mailto:achill(at)matrix(dot)gatewaynet(dot)com>> wrote:
>>
>> On 07/06/2017 16:33, ADSJ (Adam Sjøgren) wrote:
>>> Our database has started reporting errors like this:
>>>
>>> 2017-05-31 13:48:10 CEST ERROR: unexpected chunk number 0 (expected 1) for toast value 14242189 in pg_toast_10919630
>>> ...
>>> 2017-06-01 11:06:56 CEST ERROR: unexpected chunk number 0 (expected 1) for toast value 19573520 in pg_toast_10919630
>>>
>>> (157 times, for different toast values, same pg_toast_nnn). pg_toast_10919630
>>> corresponds to a table with around 168 million rows.
>>>
>>> These went away, but the next day we got similar errors from another
>>> table:
>>>
>>> 2017-06-02 05:59:50 CEST ERROR: unexpected chunk number 0 (expected 1) for toast value 47060150 in pg_toast_10920100
>>> ...
>>> 2017-06-02 06:14:54 CEST ERROR: unexpected chunk number 0 (expected 1) for toast value 47226455 in pg_toast_10920100
>>>
>>> (Only 4 this time) pg_toast_10920100 corresponds to a table with holds
>>> around 320 million rows (these are our two large tables).
>>>
>>> The next day we got 6 such errors and the day after 10 such errors. On
>>> June 5th we got 94, yesterday we got 111, of which one looked a little
>>> different:
>>>
>>> 2017-06-06 17:32:21 CEST ERROR: unexpected chunk size 1996 (expected 1585) in final chunk 0 for toast value 114925100 in pg_toast_10920100
>>>
>>> and today the logs have 65 lines, ending with these:
>>>
>>> 2017-06-07 14:49:53 CEST ERROR: unexpected chunk number 0 (expected 1) for toast value 131114834 in pg_toast_10920100
>>> 2017-06-07 14:53:41 CEST ERROR: unexpected chunk number 0 (expected 1) for toast value 131149566 in pg_toast_10920100
>> First try to find which tables those toast relations refer to :
>> select 10919630::regclass , 10920100::regclass ;
>> Are those critical tables? Can you restore them somehow?
>>
>> Also you may consider
>> REINDEX TABLE pg_toast.pg_toast_10920100;
>> REINDEX TABLE pg_toast.pg_toast_10919630;
>> REINDEX TABLE <name of table 10920100>;
>> REINDEX TABLE <name of table 10919630>;
>>
>> also VACUUM the above tables.
>>
>> You might want to write a function which iterates over the damaged table's rows in order to identify the damaged row(s). And then do some good update to create a new version.
>>
>>> The database is 10 TB on disk (SSDs) and runs on a 48 core server with 3
>>> TB RAM on Ubuntu 14.04 (Linux 3.18.13).
>>>
>>> We are updating rows in the database a lot/continuously.
>>>
>>> There are no apparent indications of hardware errors (like ECC) in
>>> dmesg, nor any error messages logged by the LSI MegaRAID controller, as
>>> far as I can tell.
>>>
>>> We are running PostgreSQL 9.3.14 currently.
>>>
>>> The only thing I could see in the release notes since 9.3.14 that might
>>> be related is this:
>>>
>>> "* Avoid very-low-probability data corruption due to testing tuple
>>> visibility without holding buffer lock (Thomas Munro, Peter Geoghegan,
>>> Tom Lane)"
>>>
>>> Although reading more about it, it doesn't sound like it would exhibit
>>> the symptoms we see?
>>>
>>> We have recently increased the load (to around twice the number of
>>> cores), though, which made me think we could be triggering corner cases
>>> we haven't hit before.
>>>
>>> We will be upgrading to PostgreSQL 9.3.17 during the weekend, but I'd like to hear
>>> if anyone has seen something like this, or have some ideas of how to
>>> investigate/what the cause might be.
>>>
>>>
>>> Best regards,
>>>
>>> Adam
>>>
>>
>> --
>> Achilleas Mantzios
>> IT DEV Lead
>> IT DEPT
>> Dynacom Tankers Mgmt
>>
>>
>>
>> --
>> Sent via pgsql-general mailing list (pgsql-general(at)postgresql(dot)org <mailto:pgsql-general(at)postgresql(dot)org>)
>> To make changes to your subscription:
>> http://www.postgresql.org/mailpref/pgsql-general
>

--
Achilleas Mantzios
IT DEV Lead
IT DEPT
Dynacom Tankers Mgmt

In response to

Browse pgsql-general by date

  From Date Subject
Next Message David Rosenstrauch 2017-06-07 15:11:53 Re: Advisory lock deadlock issue
Previous Message tel medola 2017-06-07 14:53:31 Re: Redo the filenode link in tablespace