Re: FSM corruption leading to errors

From: Michael Paquier <michael(dot)paquier(at)gmail(dot)com>
To: Pavan Deolasee <pavan(dot)deolasee(at)gmail(dot)com>
Cc: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: FSM corruption leading to errors
Date: 2016-10-10 23:50:18
Message-ID: CAB7nPqSd4L_P7NOGQ1CMeE8ecS8kyUyBP8Rv+9cMxFyJR=_EcQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Mon, Oct 10, 2016 at 11:41 PM, Pavan Deolasee
<pavan(dot)deolasee(at)gmail(dot)com> wrote:
>
>
> On Mon, Oct 10, 2016 at 7:55 PM, Michael Paquier <michael(dot)paquier(at)gmail(dot)com>
> wrote:
>>
>>
>>
>> + /*
>> + * See comments in GetPageWithFreeSpace about handling outside the
>> valid
>> + * range blocks
>> + */
>> + nblocks = RelationGetNumberOfBlocks(rel);
>> + while (target_block >= nblocks && target_block != InvalidBlockNumber)
>> + {
>> + target_block = RecordAndGetPageWithFreeSpace(rel, target_block, 0,
>> + spaceNeeded);
>> + }
>> Hm. This is just a workaround. Even if things are done this way the
>> FSM will remain corrupted.
>
>
> No, because the code above updates the FSM of those out-of-the range blocks.
> But now that I look at it again, may be this is not correct and it may get
> into an endless loop if the relation is repeatedly extended concurrently.

Ah yes, that's what the call for
RecordAndGetPageWithFreeSpace()/fsm_set_and_search() is for. I missed
that yesterday before sleeping.

>> And isn't that going to break once the
>> relation is extended again?
>
>
> Once the underlying bug is fixed, I don't see why it should break again. I
> added the above code to mostly deal with already corrupt FSMs. May be we can
> just document and leave it to the user to run some correctness checks (see
> below), especially given that the code is not cheap and adds overheads for
> everybody, irrespective of whether they have or will ever have corrupt FSM.

Yep. I'd leave it for the release notes to hold a diagnostic method.
That's annoying, but this has been done in the past like for the
multixact issues..

>> I'd suggest instead putting in the release
>> notes a query that allows one to analyze what are the relations broken
>> and directly have them fixed. That's annoying, but it would be really
>> better than a workaround. One idea here is to use pg_freespace() and
>> see if it returns a non-zero value for an out-of-range block on a
>> standby.
>>
>
> Right, that's how I tested for broken FSMs. A challenge with any such query
> is that if the shared buffer copy of the FSM page is intact, then the query
> won't return problematic FSMs. Of course, if the fix is applied to the
> standby and is restarted, then corrupt FSMs can be detected.

What if you restart the standby, and then do a diagnostic query?
Wouldn't that be enough? (Something just based on
pg_freespace(pg_relation_size(oid) / block_size) != 0)
--
Michael

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tatsuo Ishii 2016-10-11 00:13:53 parallel.sgml
Previous Message Dave Cramer 2016-10-10 23:47:19 Re: PL/Python adding support for multi-dimensional arrays