Re: ERROR: unexpected chunk number 0 (expected 1) for toast value 76753264 in pg_toast_10920100

From: Harry Ambrose <harry(dot)ambrose(at)gmail(dot)com>
To: Achilleas Mantzios <achill(at)matrix(dot)gatewaynet(dot)com>
Cc: pgsql-general(at)postgresql(dot)org
Subject: Re: ERROR: unexpected chunk number 0 (expected 1) for toast value 76753264 in pg_toast_10920100
Date: 2017-06-07 14:49:08
Message-ID: EBB819BB-5660-4D9A-BD87-10ED5D7B1DE0@gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

Hi,

I too have been experiencing this with a busy PostgreSQL instance.

I have been following the updates to the 9.4 branch hoping a fix will appear, but sadly no luck yet. I have manually replicated the issue on 9.4.4, 9.4.10 and 9.4.12. My replication steps are:

BEGIN;
CREATE TABLE x (id BIGSERIAL PRIMARY KEY, payload1 VARCHAR, payload2 VARCHAR, payload3 VARCHAR, payload4 BIGINT, payload5 BIGINT);
/* Repeat until 2,000,000 rows are inserted */
INSERT INTO x (id, payload1, payload2, payload3, payload4, payload5) VALUES (random values of varying length/size to force random toast usage);
COMMIT;

VACUUM (ANALYZE, FULL);

BEGIN;
/* Repeat until all 2,000,000 rows are updated */
UPDATE x SET payload1 = , payload2 = , payload3 = , payload4 = , payload5 = ... again random values of varying length/size to force random toast usage
COMMIT;

VACCUM (ANALYZE, FULL);

The second vacuum causes an ERROR identical to that you are reporting below (unexpected chunk number n (expected n) for toast value...). However it may take up to ten attempts to replicate it.

Out of interest, are you using any tablespaces other than pg_default? I can only replicate the issue when using separately mounted tablespaces.

I have been investigating this quite extensively and everything I can find on the web suggests data corruption. However running the the following DO reports no errors and I can dump the database without issue.

DO $$
DECLARE

curid INT := 0;
vcontent RECORD;
badid BIGINT;

var1_sub VARCHAR;
var2_sub VARCHAR;
var3_sub VARCHAR;
var4_sub VARCHAR;
var5_sub VARCHAR;

BEGIN
FOR badid IN SELECT id FROM x
LOOP
curid = curid + 1;

IF curid % 100000 = 0
THEN
RAISE NOTICE '% rows inspected', curid;
END IF;

BEGIN
SELECT *
INTO vcontent
FROM x
WHERE rowid = badid;

var1_sub := SUBSTR(vcontent.var1,2000,5000);
var2_sub := SUBSTR(vcontent.var2,2000,5000);
var3_sub := SUBSTR(vcontent.var3,2000,5000);
var4_sub := SUBSTR(vcontent.var4::VARCHAR,2000,5000);
var5_sub := SUBSTR(vcontent.var5::VARCHAR,2000,5000);

EXCEPTION WHEN OTHERS THEN
RAISE NOTICE 'Data for rowid % is corrupt', badid;
CONTINUE;
END;

END LOOP;
END;
$$;

I also found the following has been reported: https://www.postgresql.org/message-id/20161201165505.4360.28203@wrigleys.postgresql.org

Best wishes,
Harry

> On 7 Jun 2017, at 15:22, Achilleas Mantzios <achill(at)matrix(dot)gatewaynet(dot)com> wrote:
>
> On 07/06/2017 16:33, ADSJ (Adam Sjøgren) wrote:
>> Our database has started reporting errors like this:
>>
>> 2017-05-31 13:48:10 CEST ERROR: unexpected chunk number 0 (expected 1) for toast value 14242189 in pg_toast_10919630
>> ...
>> 2017-06-01 11:06:56 CEST ERROR: unexpected chunk number 0 (expected 1) for toast value 19573520 in pg_toast_10919630
>>
>> (157 times, for different toast values, same pg_toast_nnn). pg_toast_10919630
>> corresponds to a table with around 168 million rows.
>>
>> These went away, but the next day we got similar errors from another
>> table:
>>
>> 2017-06-02 05:59:50 CEST ERROR: unexpected chunk number 0 (expected 1) for toast value 47060150 in pg_toast_10920100
>> ...
>> 2017-06-02 06:14:54 CEST ERROR: unexpected chunk number 0 (expected 1) for toast value 47226455 in pg_toast_10920100
>>
>> (Only 4 this time) pg_toast_10920100 corresponds to a table with holds
>> around 320 million rows (these are our two large tables).
>>
>> The next day we got 6 such errors and the day after 10 such errors. On
>> June 5th we got 94, yesterday we got 111, of which one looked a little
>> different:
>>
>> 2017-06-06 17:32:21 CEST ERROR: unexpected chunk size 1996 (expected 1585) in final chunk 0 for toast value 114925100 in pg_toast_10920100
>>
>> and today the logs have 65 lines, ending with these:
>>
>> 2017-06-07 14:49:53 CEST ERROR: unexpected chunk number 0 (expected 1) for toast value 131114834 in pg_toast_10920100
>> 2017-06-07 14:53:41 CEST ERROR: unexpected chunk number 0 (expected 1) for toast value 131149566 in pg_toast_10920100
> First try to find which tables those toast relations refer to :
> select 10919630::regclass , 10920100::regclass ;
> Are those critical tables? Can you restore them somehow?
>
> Also you may consider
> REINDEX TABLE pg_toast.pg_toast_10920100;
> REINDEX TABLE pg_toast.pg_toast_10919630;
> REINDEX TABLE <name of table 10920100>;
> REINDEX TABLE <name of table 10919630>;
>
> also VACUUM the above tables.
>
> You might want to write a function which iterates over the damaged table's rows in order to identify the damaged row(s). And then do some good update to create a new version.
>
>> The database is 10 TB on disk (SSDs) and runs on a 48 core server with 3
>> TB RAM on Ubuntu 14.04 (Linux 3.18.13).
>>
>> We are updating rows in the database a lot/continuously.
>>
>> There are no apparent indications of hardware errors (like ECC) in
>> dmesg, nor any error messages logged by the LSI MegaRAID controller, as
>> far as I can tell.
>>
>> We are running PostgreSQL 9.3.14 currently.
>>
>> The only thing I could see in the release notes since 9.3.14 that might
>> be related is this:
>>
>> "* Avoid very-low-probability data corruption due to testing tuple
>> visibility without holding buffer lock (Thomas Munro, Peter Geoghegan,
>> Tom Lane)"
>>
>> Although reading more about it, it doesn't sound like it would exhibit
>> the symptoms we see?
>>
>> We have recently increased the load (to around twice the number of
>> cores), though, which made me think we could be triggering corner cases
>> we haven't hit before.
>>
>> We will be upgrading to PostgreSQL 9.3.17 during the weekend, but I'd like to hear
>> if anyone has seen something like this, or have some ideas of how to
>> investigate/what the cause might be.
>>
>>
>> Best regards,
>>
>> Adam
>>
>
> --
> Achilleas Mantzios
> IT DEV Lead
> IT DEPT
> Dynacom Tankers Mgmt
>
>
>
> --
> Sent via pgsql-general mailing list (pgsql-general(at)postgresql(dot)org <mailto:pgsql-general(at)postgresql(dot)org>)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-general <http://www.postgresql.org/mailpref/pgsql-general>

In response to

Responses

Browse pgsql-general by date

  From Date Subject
Next Message tel medola 2017-06-07 14:53:31 Re: Redo the filenode link in tablespace
Previous Message Merlin Moncure 2017-06-07 14:32:36 Re: Advisory lock deadlock issue