VACUUM FULL versus system catalog cache invalidation

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: pgsql-hackers(at)postgreSQL(dot)org
Subject: VACUUM FULL versus system catalog cache invalidation
Date: 2011-08-12 18:09:35
Message-ID: 1351.1313172575@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

I've been testing the problem reported by Dave Gould by running "make
installcheck-parallel" together with a tight loop of "vacuum full
pg_class":

while true; do psql -c "vacuum full pg_class" regression; usleep 100000; done

Even after fixing the cache-reset-recovery-order problem I described
yesterday, there are still occasional bizarre failures, for example
in this bit of truncate.sql:

BEGIN;
SELECT * FROM trunc_f;
TRUNCATE trunc_f;
SELECT * FROM trunc_f;
ROLLBACK;

BEGIN;
SELECT * FROM trunc_f;
TRUNCATE ONLY trunc_f;
ERROR: attempted to update invisible tuple

I'm not sure that there's only one bug remaining here, but I've
identified this bug at least. In the first transaction above, the
TRUNCATE has to update trunc_f's pg_class tuple with a new relfilenode
value. Say the update invalidates the tuple at tid A and inserts the
updated version at tid B. Then vacuum full gets control and rewrites
pg_class. It correctly keeps both versions of the trunc_f tuple, but
now they will be at new tid locations, say C and D. The original
transaction meanwhile was blocked trying to re-open pg_class to reload
trunc_f's pg_class tuple into its catcaches. Now it successfully does
that, and the tuple it reads has tid D. It finishes out the
transaction, and rolls back as per script. In the rollback, we know
that we have to flush catcache entries for tuple versions inserted by
the failed transaction ... but what inval.c has stored is an inval event
against pg_class tid B. There is no such entry in the catcaches, so
nothing happens, and the entry with tid D stays valid. Ooops.
The relcache entry for trunc_f does get rebuilt correctly (without
consulting the bogus catcache entry), so the next SELECT goes through
all right. But when the second TRUNCATE wants to again update the
pg_class tuple, it finds it in the catcaches ... and what it finds has
tid D, so it attempts to heap_update that TID instead of the actually
live tuple at tid C.

In this particular case, the incorrectly targeted tuple is the
just-invalidated version of trunc_f, so heap_update sees it as
XMIN_INVALID and fails with "attempted to update invisible tuple".

The potential consequences are hugely worse than that, though: with a
bit more time between the two operations, it'd be possible for someone
else to reclaim the dead tuple and replace it with something else.
As long as the TID is live when we get to it, heap_update will blindly
replace it, whether or not it has anything to do with the tuple we think
we're replacing. I suspect that some of the other bizarre cases I'm
seeing are artifacts of that type of outcome, but they're hard to
reproduce so I've not been able to pin them down yet.

In principle this same type of failure could have occurred before 9.0,
since catcache inval has always been TID-based. It turns out we were
saved by the fact that the old VACUUM FULL implementation would chicken
out and not do any tuple moving if it found any INSERT_IN_PROGRESS or
DELETE_IN_PROGRESS tuples. The pre-9.0 versions of CLUSTER reject such
tuples too, so I don't think we need to do anything in those branches.

But in 9.0 and up, we have a problem. So far I've thought of two possible
avenues to fix it:

1. When a SHAREDINVALCATALOG_ID inval message comes in (telling us a VAC
FULL or CLUSTER just finished on a system catalog), enter that message
into the transaction's local inval list as well as executing it
immediately. On transaction abort, redo the resulting cache flushes a
second time. This would ensure we flushed any bad entries even though
they were logged with obsolete TIDs elsewhere in the inval list.
(We might need to do this on commit too, though right now I don't think
so. Also, a cache reset event would a fortiori have to queue a similar
entry to blow things away a second time at transaction end.)

2. Forget about targeting catcache invals by TID, and instead just use the
key hash value to determine which cache entries to drop.

Approach #2 seems a lot less invasive and more trustworthy, but it has the
disadvantage that cache invals would become more likely to blow away
entries unnecessarily (because of chance hashvalue collisions), even
without any VACUUM FULL being done. If we could make approach #1 work
reliably, it would result in more overhead during VACUUM FULL but less at
other times --- or at least we could hope so. In an environment where
lots of sinval overflows and consequent resets happen, we might come out
behind due to doubling the number of catcache flushes forced by a reset
event.

Right at the moment I'm leaning to approach #2. I wonder if anyone
sees it differently, or has an idea for a third approach?

BTW, going forward it might be interesting to think about invalidating
the catcaches based on xmin/xmax rather than specific TIDs. That would
reduce the sinval traffic to one message per transaction that had
affected the catalogs, instead of one per TID. But that clearly is not
going to lead to something we'd dare back-patch.

regards, tom lane

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Simon Riggs 2011-08-12 18:35:05 Re: Reworking the writing of WAL
Previous Message Robert Haas 2011-08-12 18:02:17 Re: Reworking the writing of WAL