Re: FSM corruption leading to errors

From: Anastasia Lubennikova <a(dot)lubennikova(at)postgrespro(dot)ru>
To: Pavan Deolasee <pavan(dot)deolasee(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: FSM corruption leading to errors
Date: 2016-10-07 14:50:21
Message-ID: 0673bdc7-b38f-44ed-0e7b-c36801b1f13a@postgrespro.ru
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

06.10.2016 20:59, Pavan Deolasee:
> I investigated a bug report from one of our customers and it looked
> very similar to previous bug reports here [1], [2], [3] (and probably
> more). In these reports, the error looks something like this:
>
> ERROR: could not read block 28991 in file "base/16390/572026": read
> only 0 of 8192 bytes
>
> I traced it to the following code in MarkBufferDirtyHint(). The
> function returns without setting the DIRTY bit on the standby:
>
> 3413 /*
> 3414 * If we're in recovery we cannot dirty a page
> because of a hint.
> 3415 * We can set the hint, just not dirty the page as a
> result so the
> 3416 * hint is lost when we evict the page or shutdown.
> 3417 *
> 3418 * See src/backend/storage/page/README for longer
> discussion.
> 3419 */
> 3420 if (RecoveryInProgress())
> 3421 return;
> 3422
>
> freespace.c freely uses MarkBufferDirtyHint() whenever changes are
> made to the FSM. I think that's usually alright because FSM changes
> are not WAL logged and if FSM ever returns a block with less free
> space than the caller needs, the caller is usually prepared to update
> the FSM and request for a new block. But if it returns a block that is
> outside the size of the relation, then we've a trouble. The very next
> ReadBuffer() fails to handle such a block and throws the error.
>
> When a relation is truncated, the FSM is truncated too to remove
> references to the heap blocks that are being truncated. But since the
> FSM buffer may not be marked DIRTY on the standby, if the buffer gets
> evicted from the buffer cache, the on-disk copy of the FSM page may be
> left with references to the truncated heap pages. When the standby is
> later promoted to be the master, and an insert/update is attempted to
> the table, the FSM may return a block that is outside the valid range
> of the relation. That results in the said error.
>
> Once this was clear, it was easy to put together a fully reproducible
> test case. See the attached script; you'll need to adjust to your
> environment. This affects all releases starting 9.3 and the script can
> reproduce the problem on all these releases.
>
> I believe the fix is very simple. The FSM change during truncation is
> critical and the buffer must be marked by MarkBufferDirty() i.e. those
> changes must make to the disk. I think it's alright not to WAL log
> them because XLOG_SMGR_TRUNCATE will redo() them if a crash occurs.
> But it must not be lost across a checkpoint. Also, since it happens
> only during relation truncation, I don't see any problem from
> performance perspective.
>
> What bothers me is how to fix the problem for already affected
> standbys. If the FSM for some table is already corrupted at the
> standby, users won't notice it until the standby is promoted to be the
> new master. If the standby starts throwing errors suddenly after
> failover, it will be a very bad situation for the users, like we
> noticed with our customers. The fix is simple and users can just
> delete the FSM (and VACUUM the table), but that doesn't sound nice and
> they would not know until they see the problem.
>
> One idea is to always check if the block returned by the FSM is
> outside the range and discard such blocks after setting the FSM
> (attached patch does that). The problem with that approach is that
> RelationGetNumberOfBlocks() is not really cheap and invoking it
> everytime FSM is consulted may not be a bright idea. Can we cache that
> value in the RelationData or some such place (BulkInsertState?) and
> use that as a hint for quickly checking if the block is (potentially)
> outside the range and discard it? Any other ideas?
>
> The other concern I've and TBH that's what I initially thought as the
> real problem, until I saw RecoveryInProgress() specific code, is: can
> this also affect stand-alone masters? The comments at
> MarkBufferDirtyHint() made me think so:
>
> 3358 * 3. This function does not guarantee that the buffer is always
> marked dirty
> 3359 * (due to a race condition), so it cannot be used for
> important changes.
>
> So I was working with a theory that somehow updates to the FSM page
> are lost because the race mentioned in the comment actually kicks in.
> But I'm not sure if the race is only possible when the caller is
> holding a SHARE lock on the buffer. When the FSM is truncated, the
> caller holds an EXCLUSIVE lock on the FSM buffer. So probably we're
> safe. I could not reproduce the issue on a stand-alone master. But
> probably worth checking.
>
> It might also be a good idea to inspect other callers of
> MarkBufferDirtyHint() and see if any of them is vulnerable, especially
> from standby perspective. I did one round, and couldn't see another
> problem.
>
> Thanks,
> Pavan
>
> [1]
> https://www.postgresql.org/message-id/CAJakt-8%3DaXa-F7uFeLAeSYhQ4wFuaX3%2BytDuDj9c8Gx6S_ou%3Dw%40mail.gmail.com
> [2]
> https://www.postgresql.org/message-id/20160601134819.30392.85324@wrigleys.postgresql.org
> [3]
> https://www.postgresql.org/message-id/AMSPR06MB504CD8FE8AA30D4B7C958AAE39E0%40AMSPR06MB504.eurprd06.prod.outlook.com
>
>
> --
> Pavan Deolasee http://www.2ndQuadrant.com/
> PostgreSQL Development, 24x7 Support, Training & Services
>
>

Could you please add the patches to commitfest?
I'm going to test them and write a review in a few days.

--
Anastasia Lubennikova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Robert Haas 2016-10-07 14:51:13 Re: Is it time to kill support for very old servers?
Previous Message Magnus Hagander 2016-10-07 14:33:55 Re: Is it time to kill support for very old servers?