Re: DropRelFileNodeBuffers API change (was Re: [BUGS] BUG #5599: Vacuum fails due to index corruption issues)

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: DropRelFileNodeBuffers API change (was Re: [BUGS] BUG #5599: Vacuum fails due to index corruption issues)
Date: 2010-08-16 02:54:31
Message-ID: 11503.1281927271@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs pgsql-hackers

Robert Haas <robertmhaas(at)gmail(dot)com> writes:
> On Sun, Aug 15, 2010 at 9:25 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>> ... and PANIC is absolutely, entirely, 100% unacceptable here. I don't
>> think you understand the context. We've already written the truncate
>> action to WAL (as we must: WAL before data change). If we PANIC, that
>> means we'll PANIC again during WAL replay. So that solution means a
>> down, and perhaps unrecoverably broken, database.

> All right, that would be bad.

Actually ... after some study of the code, I find that I'm holding this
proposal to a higher standard than the current code maintains.
According to our normal rules for applying WAL-loggable data changes,
there should be a critical section wrapping the application of the
data changes with making the WAL entry. RelationTruncate fails to
do any such thing: it just pushes out the WAL entry and then calls
smgrtruncate, and so any error inside the latter just results in
elog(ERROR). Whereupon the system state is entirely different from what
WAL says it should be. So my previous gut feeling that we need to
rethink this whole thing seems justified.

I traced through everything that leads to an ftruncate() call in the
backend as of HEAD, and found that we have these cases to worry about:

mdunlink calls ftruncate directly, and does nothing worse than
elog(WARNING) on failure. This is fine because all it wants to do
is save some disk space until the file gets deleted "for real" at
the next checkpoint. Failure only means we waste disk space
temporarily, and is surely not cause for panic.

All other calls proceed (sometimes indirectly) from RelationTruncate
or replay of the WAL record it emits. We have not thought hard
about the order of the various truncations this does and whether or
not we have a self-consistent state if it fails partway through.
If we don't want to make the whole thing a critical section, we need
to figure that out.

RelationTruncate (and its subsidiary RelationTruncateIndexes) is called
from heap_truncate_one_rel (which itself does things in an unsafe order),
and from lazy_truncate_heap in VACUUM.

heap_truncate_one_rel has (indirectly) two call sources:

from ExecuteTruncate for a locally created rel, where we don't care,
and would definitely rather NOT have a panic on error: just rolling back
the transaction is fine thank you very much.

from PreCommit_on_commit_actions, to process ON COMMIT DELETE ROWS.
Here again, so far as heaps are concerned, rollback on error would be
plenty since any inserted rows would then just be dead. The tricky
part is the indexes for such a table. If we truncate them before
truncating the heap, then the worst possible case is an internally
inconsistent index on a temp table, which will be automatically cleaned
up during the next successful commit in its session. So it's pretty
hard to justify a PANIC response here either.

So it seems like the only case where there is really grounds for PANIC
on failure is the VACUUM case. And there we might apply Heikki's idea
of trying to zero the untruncatable pages first.

I'm thinking that we need some sort of what-to-do-on-error flag passed
into RelationTruncate, plus at least order-of-operations fixes in
several other places, if not a wholesale refactoring of this whole call
stack. But I'm running out of steam and don't have a concrete proposal
to make right now. In any case, we've got more problems here than just
the original one of forgetting dirty buffers too soon.

regards, tom lane

In response to

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message Mariusz Majer 2010-08-16 07:54:44 Re: BUG #5614: Varchar column (with DEFAULT NULL) stores 'UL' value instead of null
Previous Message Robert Haas 2010-08-16 01:58:13 Re: DropRelFileNodeBuffers API change (was Re: [BUGS] BUG #5599: Vacuum fails due to index corruption issues)

Browse pgsql-hackers by date

  From Date Subject
Next Message KaiGai Kohei 2010-08-16 02:59:29 Re: security label support, part.2
Previous Message Robert Haas 2010-08-16 02:50:24 Re: refactoring comment.c