Concurrency bug with vacuum full (cluster) and toast

From: Alexander Korotkov <a(dot)korotkov(at)postgrespro(dot)ru>
To: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Concurrency bug with vacuum full (cluster) and toast
Date: 2019-03-18 16:53:22
Message-ID: CAPpHfdu3PJUzHpQrvJ5RC9bEX_bZ6LwX52kBpb0EiD_9e3Np5g@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi all,

I've discovered bug, when vacuum full fails with error, because it
couldn't find toast chunks deleted by itself. That happens because
cluster_rel() sets OldestXmin, but toast accesses gets snapshot later
and independently. That causes heap_page_prune_opt() to clean chunks,
which rebuild_relation() expects to exist. This bug very rarely
happens on busy systems which actively update toast values. But I
found way to reliably reproduce it using debugger.

*Setup*

CREATE FUNCTION random_string(seed integer, length integer) RETURNS text
AS $$
SELECT substr(
string_agg(
substr(
encode(
decode(
md5(seed::text || '-' || i::text),
'hex'),
'base64'),
1, 21),
''),
1, length)
FROM generate_series(1, (length + 20) / 21) i; $$
LANGUAGE SQL;

CREATE TABLE test (val text);
INSERT INTO test (random_string(1,100000));

*Reproduction steps*

s1-s3 are three parallel PostgreSQL sessions
s3lldb is lldb connected to s1

At first s1 acquires snapshot and holds it.

s1# begin transaction isolation level repeatable read;
s1# select 1;

Then s2 makes multiple updates of our toasted value.

s2# update test set val = random_string(2,100000);
s2# update test set val = random_string(3,100000);
s2# update test set val = random_string(4,100000);
s2# update test set val = random_string(5,100000);
s2# update test set val = random_string(6,100000);
s2# update test set val = random_string(7,100000);

Then s3 starting vacuum full stopping on vacuum_set_xid_limits().

s3lldb# b vacuum_set_xid_limits
s3# vacuum full test;

We pass vacuum_set_xid_limits() making sure old tuple versions made by
s2 would be recently dead for vacuum full.

s3lldb# finish

Then s1 releases snapshot. Then heap_page_prune_opt() called from
toast accessed would cleanup toast chunks, which vacuum full expects
to be recently dead.

s1# commit;

Finally, we continue our vacuum full and get error!

s3lldb# continue
s3#
ERROR: unexpected chunk number 50 (expected 2) for toast value 16429
in pg_toast_16387

Attached patch contains dirty fix of this bug, which just prevents
heap_page_prune_opt() from clean tuples, when it's called from
rebuild_relation(). Actually, it's not something I'm proposing to
commit or even review, it might be just some start point for thoughts.

Any ideas?

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Attachment Content-Type Size
cluster_toast_concurrency_fix.patch application/octet-stream 4.1 KB

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Peter Eisentraut 2019-03-18 17:02:52 Re: Fix optimization of foreign-key on update actions
Previous Message Ildar Musin 2019-03-18 16:11:01 Re: [HACKERS] Custom compression methods