GetOldestXmin going backwards is dangerous after all

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: pgsql-hackers(at)postgreSQL(dot)org
Subject: GetOldestXmin going backwards is dangerous after all
Date: 2013-02-01 19:35:27
Message-ID: 20362.1359747327@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

I've been able to reproduce the problem reported by Pius Chan in bug
#7819. With some logging added that prints the OldestXmin values used
by vacuum and cluster operations, the reason is fairly clear:

2013-02-01 13:41:12 EST 8011 LOG: vacuuming "test" with OldestXmin 1760160 FreezeLimit 4246727456
2013-02-01 13:41:14 EST 8011 LOG: automatic vacuum of table "test.public.test": index scans: 1
pages: 0 removed, 635 remain
tuples: 48895 removed, 91 remain
system usage: CPU 0.00s/0.03u sec elapsed 1.64 sec
2013-02-01 13:41:14 EST 8011 LOG: automatic analyze of table "test.public.test" system usage: CPU 0.00s/0.00u sec elapsed 0.09 sec
2013-02-01 13:41:14 EST 8011 LOG: vacuuming "pg_toast_115435" with OldestXmin 1761518 FreezeLimit 4246728814
2013-02-01 13:41:57 EST 8011 LOG: automatic vacuum of table "test.pg_toast.pg_toast_115435": index scans: 1
pages: 0 removed, 12904 remain
tuples: 7624 removed, 43992 remain
system usage: CPU 0.21s/0.13u sec elapsed 42.79 sec
2013-02-01 13:41:58 EST 7025 LOG: clustering "test" with OldestXmin 1752719 FreezeXid 1630979
2013-02-01 13:41:58 EST 7025 ERROR: missing chunk number 0 for toast value 215828 in pg_toast_115435

IOW, autovacuum worker 8011 vacuumed "test" with one xmin, and then a
bit later vacuumed its toast table with a slightly newer xmin. This
means that certain tuples' toast tuples could get cleaned out of the
toast table while the parent tuples are still there in the heap.
Normally this is fine because the parent tuples are certainly dead and
beyond snapshot range, so no one will make any attempt to fetch their
toast values. But then, here comes CLUSTER with an older xmin ---
so it thinks it needs to copy over some of those dead-but-possibly-
still-visible-to-somebody tuples, and kaboom.

Now, CLUSTER is very careful to make sure that it gets a lock on the
toast table before computing OldestXmin, with the express intent of
preventing this type of problem (cf comment in copy_heap_data).
However, if the result of GetOldestXmin() can ever go backwards,
that doesn't help. And guess what ... it can.

The header comment for GetOldestXmin() contains a long rationalization
for why it's safe for the result to sometimes go backwards, but now
that I've seen this example I'm pretty sure that that's all horsepucky.

I think we need to either ensure that GetOldestXmin is strictly
nondecreasing, or find some more-direct way of interlocking vacuuming
of toast tables with their parents.

We could possibly do the former by tracking the last-known value in
shared memory, but it'd be complicated because the value can
(legitimately) vary depending on which database you ask about.

The only backpatchable way I can think of to do the latter is to undo
the decoupling of main and toast-table vacuuming, and then say that
we simply don't change the value of OldestXmin when going to vacuum
the toast table. This is a bit unpleasant.

In any case, I no longer have much faith in the idea that letting
GetOldestXmin go backwards is really safe.

The one bright spot is that AFAICS this isn't a true data loss hazard;
no possibly-still-needed data can be lost. Any failures would be
transient failures of CLUSTER or VACUUM FULL.

Thoughts?

regards, tom lane

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Robert Haas 2013-02-01 19:46:52 Re: Re: [COMMITTERS] pgsql: Tolerate timeline switches while "pg_basebackup -X fetch" is run
Previous Message Marko Tiikkaja 2013-02-01 18:19:30 Re: PL/PgSQL STRICT