Re: BUG #5599: Vacuum fails due to index corruption issues

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Hitesh Bhambhani <hitesh(dot)bhambhani(at)asg(dot)com>
Cc: Alvaro Herrera <alvherre(at)commandprompt(dot)com>, pgsql-bugs <pgsql-bugs(at)postgresql(dot)org>
Subject: Re: BUG #5599: Vacuum fails due to index corruption issues
Date: 2010-08-05 15:06:57
Message-ID: 1880.1281020817@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs pgsql-hackers

Hitesh Bhambhani <hitesh(dot)bhambhani(at)asg(dot)com> writes:
>> From: Alvaro Herrera [mailto:alvherre(at)commandprompt(dot)com]
>> Sent: Wednesday, August 04, 2010 11:30 PM
>> There probably is. What kind of relation are the ones unable to truncate?
>> Please see in pg_class where relfilenode = '41274' in this
>> case:
>>
> [HiteshB] the relation is called pmoinstance_idx_pmotypeid.

So the relation that couldn't be truncated is indeed the same one
complained of in the later message.

What it looks like to me is that page 30 was deleted, and then vacuum
tried to truncate it off the index altogether, and that failed because
of Windows randomness, and then later it's trying to delete page 30
again. Which naturally fails because the parent downlink entry is long
gone. But it's odd that it tries to delete page 30 twice. The page
should have been in BTP_DELETED state before the truncate attempt, and
therefore should still be dead later, so why's it trying again?

[ thinks for a bit ... ] I do see a way that could happen. The page
could still be live on disk when we reach smgrtruncate(): the update
to BTP_DELETED state might only exist in a dirty shared buffer. And
lookee here what smgrtruncate does:

/*
* Get rid of any buffers for the about-to-be-deleted blocks. bufmgr will
* just drop them without bothering to write the contents.
*/
DropRelFileNodeBuffers(reln->smgr_rnode, forknum, isTemp, nblocks);

So we throw away the BTP_DELETED state update without ever writing it to
disk, and then when the truncate fails, the old page state is still out
there, ready to confuse us later.

Seems like we need to think harder about recovering from a truncate
failure. A few random ideas:

1. Write the dirty buffers before dropping them. Kind of ugly from a
performance viewpoint, but simple and safe.

2. Treat truncation failure as a PANIC condition; then WAL replay will
fix things. Unpleasant. Actually, worse than unpleasant: I think what
the comment in RelationTruncate() is pointing out is that if the
failure is repeatable, we'd fail to recover at all.

3. Don't discard the dirty buffers until after successfully truncating.
The hard part here is to be sure no other process (like bgwriter) will
try to write them in between. I don't see a way to guarantee that,
at least not without interlock infrastructure that doesn't exist today.

And see also that comment in RelationTruncate(). Seems like the whole
problem of coping with truncation failure needs more thought than we've
given it.

regards, tom lane

In response to

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message Tom Lane 2010-08-05 15:25:37 Re: string_agg delimiter having no effect with order by
Previous Message Pavel Stehule 2010-08-05 14:57:22 Re: string_agg delimiter having no effect with order by

Browse pgsql-hackers by date

  From Date Subject
Next Message Joshua Tolley 2010-08-05 15:12:27 Re: GROUPING SETS revisited
Previous Message Pavel Stehule 2010-08-05 14:57:22 Re: string_agg delimiter having no effect with order by