Re: FSM corruption leading to errors

From: Pavan Deolasee <pavan(dot)deolasee(at)gmail(dot)com>
To: Michael Paquier <michael(dot)paquier(at)gmail(dot)com>
Cc: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: FSM corruption leading to errors
Date: 2016-10-10 14:41:16
Message-ID: CABOikdM5rw=25qQc+wZoYN5yym2r09Q9X0Ria4_P48CGeCRU_g@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Mon, Oct 10, 2016 at 7:55 PM, Michael Paquier <michael(dot)paquier(at)gmail(dot)com>
wrote:

>
>
> + /*
> + * See comments in GetPageWithFreeSpace about handling outside the
> valid
> + * range blocks
> + */
> + nblocks = RelationGetNumberOfBlocks(rel);
> + while (target_block >= nblocks && target_block != InvalidBlockNumber)
> + {
> + target_block = RecordAndGetPageWithFreeSpace(rel, target_block, 0,
> + spaceNeeded);
> + }
> Hm. This is just a workaround. Even if things are done this way the
> FSM will remain corrupted.

No, because the code above updates the FSM of those out-of-the range
blocks. But now that I look at it again, may be this is not correct and it
may get into an endless loop if the relation is repeatedly extended
concurrently.

> And isn't that going to break once the
> relation is extended again?

Once the underlying bug is fixed, I don't see why it should break again. I
added the above code to mostly deal with already corrupt FSMs. May be we
can just document and leave it to the user to run some correctness checks
(see below), especially given that the code is not cheap and adds overheads
for everybody, irrespective of whether they have or will ever have corrupt
FSM.

> I'd suggest instead putting in the release
> notes a query that allows one to analyze what are the relations broken
> and directly have them fixed. That's annoying, but it would be really
> better than a workaround. One idea here is to use pg_freespace() and
> see if it returns a non-zero value for an out-of-range block on a
> standby.
>
>
Right, that's how I tested for broken FSMs. A challenge with any such query
is that if the shared buffer copy of the FSM page is intact, then the query
won't return problematic FSMs. Of course, if the fix is applied to the
standby and is restarted, then corrupt FSMs can be detected.

>
> At the same time, I have translated your script into a TAP test, I
> found that more useful when testing..
>
> Thanks for doing that.

Thanks,
Pavan

--
Pavan Deolasee http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Merlin Moncure 2016-10-10 14:44:57 Re: autonomous transactions
Previous Message Michael Paquier 2016-10-10 14:29:21 Re: FSM corruption leading to errors