Re: DropRelFileNodeBuffers API change (was Re: [BUGS] BUG #5599: Vacuum fails due to index corruption issues)

From: Bruce Momjian <bruce(at)momjian(dot)us>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: DropRelFileNodeBuffers API change (was Re: [BUGS] BUG #5599: Vacuum fails due to index corruption issues)
Date: 2011-02-18 21:43:19
Message-ID: 201102182143.p1ILhJo27167@momjian.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs pgsql-hackers


Did we make any progress on this? Is it a TODO?

---------------------------------------------------------------------------

Tom Lane wrote:
> Robert Haas <robertmhaas(at)gmail(dot)com> writes:
> > On Sun, Aug 15, 2010 at 9:25 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> >> ... and PANIC is absolutely, entirely, 100% unacceptable here. I don't
> >> think you understand the context. We've already written the truncate
> >> action to WAL (as we must: WAL before data change). If we PANIC, that
> >> means we'll PANIC again during WAL replay. So that solution means a
> >> down, and perhaps unrecoverably broken, database.
>
> > All right, that would be bad.
>
> Actually ... after some study of the code, I find that I'm holding this
> proposal to a higher standard than the current code maintains.
> According to our normal rules for applying WAL-loggable data changes,
> there should be a critical section wrapping the application of the
> data changes with making the WAL entry. RelationTruncate fails to
> do any such thing: it just pushes out the WAL entry and then calls
> smgrtruncate, and so any error inside the latter just results in
> elog(ERROR). Whereupon the system state is entirely different from what
> WAL says it should be. So my previous gut feeling that we need to
> rethink this whole thing seems justified.
>
> I traced through everything that leads to an ftruncate() call in the
> backend as of HEAD, and found that we have these cases to worry about:
>
> mdunlink calls ftruncate directly, and does nothing worse than
> elog(WARNING) on failure. This is fine because all it wants to do
> is save some disk space until the file gets deleted "for real" at
> the next checkpoint. Failure only means we waste disk space
> temporarily, and is surely not cause for panic.
>
> All other calls proceed (sometimes indirectly) from RelationTruncate
> or replay of the WAL record it emits. We have not thought hard
> about the order of the various truncations this does and whether or
> not we have a self-consistent state if it fails partway through.
> If we don't want to make the whole thing a critical section, we need
> to figure that out.
>
> RelationTruncate (and its subsidiary RelationTruncateIndexes) is called
> from heap_truncate_one_rel (which itself does things in an unsafe order),
> and from lazy_truncate_heap in VACUUM.
>
> heap_truncate_one_rel has (indirectly) two call sources:
>
> from ExecuteTruncate for a locally created rel, where we don't care,
> and would definitely rather NOT have a panic on error: just rolling back
> the transaction is fine thank you very much.
>
> from PreCommit_on_commit_actions, to process ON COMMIT DELETE ROWS.
> Here again, so far as heaps are concerned, rollback on error would be
> plenty since any inserted rows would then just be dead. The tricky
> part is the indexes for such a table. If we truncate them before
> truncating the heap, then the worst possible case is an internally
> inconsistent index on a temp table, which will be automatically cleaned
> up during the next successful commit in its session. So it's pretty
> hard to justify a PANIC response here either.
>
> So it seems like the only case where there is really grounds for PANIC
> on failure is the VACUUM case. And there we might apply Heikki's idea
> of trying to zero the untruncatable pages first.
>
> I'm thinking that we need some sort of what-to-do-on-error flag passed
> into RelationTruncate, plus at least order-of-operations fixes in
> several other places, if not a wholesale refactoring of this whole call
> stack. But I'm running out of steam and don't have a concrete proposal
> to make right now. In any case, we've got more problems here than just
> the original one of forgetting dirty buffers too soon.
>
> regards, tom lane
>
> --
> Sent via pgsql-hackers mailing list (pgsql-hackers(at)postgresql(dot)org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers

--
Bruce Momjian <bruce(at)momjian(dot)us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ It's impossible for everything to be true. +

In response to

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message Tom Lane 2011-02-18 21:46:01 Re: DropRelFileNodeBuffers API change (was Re: [BUGS] BUG #5599: Vacuum fails due to index corruption issues)
Previous Message Tom Lane 2011-02-18 19:02:31 Re: BUG #5878: BTREE_BUILD_STATS causes 'make check' to fail

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2011-02-18 21:46:01 Re: DropRelFileNodeBuffers API change (was Re: [BUGS] BUG #5599: Vacuum fails due to index corruption issues)
Previous Message Tom Lane 2011-02-18 21:41:23 Re: WIP - Add ability to constrain backend temporary file space