Quick Links

Concurrency bug with vacuum full (cluster) and toast

From:	Alexander Korotkov <a(dot)korotkov(at)postgrespro(dot)ru>
To:	pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Concurrency bug with vacuum full (cluster) and toast
Date:	2019-03-18 16:53:22
Message-ID:	CAPpHfdu3PJUzHpQrvJ5RC9bEX_bZ6LwX52kBpb0EiD_9e3Np5g@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

Hi all,

I've discovered bug, when vacuum full fails with error, because it
couldn't find toast chunks deleted by itself. That happens because
cluster_rel() sets OldestXmin, but toast accesses gets snapshot later
and independently. That causes heap_page_prune_opt() to clean chunks,
which rebuild_relation() expects to exist. This bug very rarely
happens on busy systems which actively update toast values. But I
found way to reliably reproduce it using debugger.

*Setup*

CREATE FUNCTION random_string(seed integer, length integer) RETURNS text
AS $$
SELECT substr(
string_agg(
substr(
encode(
decode(
md5(seed::text || '-' || i::text),
'hex'),
'base64'),
1, 21),
''),
1, length)
FROM generate_series(1, (length + 20) / 21) i; $$
LANGUAGE SQL;

CREATE TABLE test (val text);
INSERT INTO test (random_string(1,100000));

*Reproduction steps*

s1-s3 are three parallel PostgreSQL sessions
s3lldb is lldb connected to s1

At first s1 acquires snapshot and holds it.

s1# begin transaction isolation level repeatable read;
s1# select 1;

Then s2 makes multiple updates of our toasted value.

s2# update test set val = random_string(2,100000);
s2# update test set val = random_string(3,100000);
s2# update test set val = random_string(4,100000);
s2# update test set val = random_string(5,100000);
s2# update test set val = random_string(6,100000);
s2# update test set val = random_string(7,100000);

Then s3 starting vacuum full stopping on vacuum_set_xid_limits().

s3lldb# b vacuum_set_xid_limits
s3# vacuum full test;

We pass vacuum_set_xid_limits() making sure old tuple versions made by
s2 would be recently dead for vacuum full.

s3lldb# finish

Then s1 releases snapshot. Then heap_page_prune_opt() called from
toast accessed would cleanup toast chunks, which vacuum full expects
to be recently dead.

s1# commit;

Finally, we continue our vacuum full and get error!

s3lldb# continue
s3#
ERROR: unexpected chunk number 50 (expected 2) for toast value 16429
in pg_toast_16387

Attached patch contains dirty fix of this bug, which just prevents
heap_page_prune_opt() from clean tuples, when it's called from
rebuild_relation(). Actually, it's not something I'm proposing to
commit or even review, it might be just some start point for thoughts.

Any ideas?

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Attachment	Content-Type	Size
cluster_toast_concurrency_fix.patch	application/octet-stream	4.1 KB

Responses

Re: Concurrency bug with vacuum full (cluster) and toast at 2019-03-19 15:48:26 from Robert Haas

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Peter Eisentraut	2019-03-18 17:02:52	Re: Fix optimization of foreign-key on update actions
Previous Message	Ildar Musin	2019-03-18 16:11:01	Re: [HACKERS] Custom compression methods