Re: BUG #7819: missing chunk number 0 for toast value 1235919 in pg_toast_35328

From: Pius Chan <pchan(at)contigo(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: "pgsql-bugs(at)postgresql(dot)org" <pgsql-bugs(at)postgresql(dot)org>, Frank Moi <fmoi(at)contigo(dot)com>, Ken Yu <kyu(at)contigo(dot)com>, Vincent Lasmarias <vlasmarias(at)contigo(dot)com>, Vladimir Kosilov <vkosilov(at)contigo(dot)com>
Subject: Re: BUG #7819: missing chunk number 0 for toast value 1235919 in pg_toast_35328
Date: 2013-02-01 01:27:34
Message-ID: C62EC84B2D3CF847899CCF4B589CCF70B20AA08F@BBMBX.backbone.local
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

Hi Tom,

The ERROR happened again. After several days of investigation and testing, I can now reproduce the ERROR in my testing environment. The reason why the ERROR used to happen in a certain time period is that our report batch jobs run in that period and the batch job can make the TOAST area grow. I can repeat the ERROR with this set up and testing procedure.

Set up
=====
(1) check that the configuration parameter log_autovacuum_min_duration = 0
(2) in a 9.1.7 database cluster, create a database called "TEST_LOCK".
(3) In the database "TEST_LOCK", create one table called LOCK with one row in it:
create table lock (id bigint, constraint lock_pk primary key (id));
insert into lock values (1);

(4) in the same database cluster, create another database called "TEST".
(5) in the database "TEST", create one table test with the following columns:

create table test (
id bigint not null,
txop char(1),
messageblob bytea,
constraint test_pk primary key (id)
);

Pre-testing set up:
===========
(1) use psql to connect to TEST_LOCK and issue the following SQLs and leave it open during the test:
begin;
select * from lock where id =1 for update;

(2) use another psql to connect to TEST_LOCK and issue the same SQLs as in (1). This session will be blocked by (1).
(3) this is to simulate the situation that in our production environment, our middle tier service use the TEST_LOCK.lock table for master/slave fail-over. We run the middle-tier service in several servers. When the middle-tier service starts up, it tries to put a lock on that ROW. The server successfully holds this LOCK becomes the master.

Testing set-up:
=========
Part 1 - traffic that will consume the TOAST area of the table "test" in TEST database
(1) use program to INSERT INTO TEST with length of messaegblob > 1MB and txop='A'. This is to ensure the TOAST area of the test table is used. Then update the table txop to 'D. Then delete the same row. These operations are in their own transactions.
(2) this simulates the batch report traffic we see from 00:15 - 00:45 every night.

Part 2 - traffic that will not consume the TOAST area of the table "test"
(1) use program to INSERT INTO TEST with length of messageblob of size about 3000 bytes and txop='A'. Then update the row.txop to 'D'. Then delete the same row. These operations are in their own transactions.
(2) this simulates the background traffic we see during the whole day.

For part 1 and part 2, the traffic must be high enough so that from the database log:
(1) autovacuum for the table TEST.public.test runs every minute
(2) autovacuum for the table TOAST area runs every minute
(3) from result of analyze verbose test, the # of pages scanned keeps on increasing, even with autovacuum running on the table
(4) similarly, when querying the size of the TOAST, its size keeps on increasing, even with autovacuum running on the TOAST every one 1 minute
(5) cluster the table test can not reduce the table size and TOAST area size as well
(6) Any any time, the # of live rows is only a few but the # of dead rows is high.

Testing procedures:
============
(1) use psql to connect to TEST and frequently analyze the table so as to monitor the size of the table. Run SQL to check the size of the table TOAST area as well.
(2) tail the database log and check that the automatic vacuum of the table and its TOAST area happens every one minute
(3) Wait until the size of the table is at least 400 pages and the TOAST area is at least 20MB.
(4) wait for the database log to emit the "automatic vacuum of table "TEST.pg_toast.pg_toast_xxxx": index scans: (pg_toast_xxxx is the TOAST area of the table test),
IMMEDIATELY issue the command to cluster the table test in the psql session:

cluster test using test_pk;

Testing results:
=========
(1) If the timing is right, occasionally I got the ERROR (about once out of ten):

ERROR: missing chunk number 0 for toast value 13113341 in pg_toast_13084520

(2) Once I got the ERROR, subsequent CLUSTER command gets the same ERROR until the next autovacuum of the TOAST is done.

(3) I tested in PostgreSQL 9.1.7 and 9.1.3 and can get the ERROR.

So this is not a 9.1.7 specific bug.

Thanks a lot for your help.

Pius
________________________________________
From: Tom Lane [tgl(at)sss(dot)pgh(dot)pa(dot)us]
Sent: Tuesday, January 22, 2013 8:14 PM
To: Pius Chan
Cc: pgsql-bugs(at)postgresql(dot)org
Subject: Re: [BUGS] BUG #7819: missing chunk number 0 for toast value 1235919 in pg_toast_35328

Pius Chan <pchan(at)contigo(dot)com> writes:
> Hi Tom,
> Yes, we start seeing this ERROR after upgrade to 9.1.7. The ERROR is from the "cluster jms_messages" command. Last night, the ERROR happened three times:

> (1) at 00:20:01
> ERROR: missing chunk number 0 for toast value 1239124 in pg_toast_35328

> (2) at 00:25:01
> ERROR: missing chunk number 0 for toast value 1239124 in pg_toast_35328

> (3) at 00:35:01
> ERROR: missing chunk number 0 for toast value 1241022 in pg_toast_35328

> The "cluster jms_messages" runs every 5 minutes. However, so far, it seems that the ERROR happens at about mid-night and 35328 is the toast area of the "jms_message" table:

So what else is this application doing around midnight? In particular,
it seems like something must've happened between 00:15 and 00:20 to
create the problem with OID 1239124, and then something else happened
between 00:25 and 00:30 to get rid of it. And then a similar sequence
happened between 00:30 and 00:40 involving OID 1241022. Most likely the
trigger events are application actions against jms_messages, but we have
no info about what. (It's also possible that autovacuums of
jms_messages are involved, so you might want to crank up your logging
enough so you can see those in the postmaster log.)

I had first thought that this might have something to do with the
toast-value-processing changes we made in CLUSTER a year or so ago,
but a look in the commit logs confirms that those patches were in 9.1.3.
So I have no idea what might've broken between 9.1.3 and 9.1.7. We
need data, or even better an actual test case.

regards, tom lane

In response to

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message Toby Murray 2013-02-01 04:57:02 Re: Violation of primary key constraint
Previous Message Tom Lane 2013-01-31 23:43:54 Re: Violation of primary key constraint