Re: FSM corruption leading to errors

From: Michael Paquier <michael(dot)paquier(at)gmail(dot)com>
To: Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
Cc: Pavan Deolasee <pavan(dot)deolasee(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: FSM corruption leading to errors
Date: 2016-10-18 03:12:54
Message-ID: CAB7nPqTnVw7TfD9n+-fLX7u+1qWEU3pYTOv0TPq1NoRp1SbhAQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Mon, Oct 17, 2016 at 8:04 PM, Heikki Linnakangas <hlinnaka(at)iki(dot)fi> wrote:
> visibilitymap_truncate is actually also wrong, in a different way. The
> truncation WAL record is written only after the VM (and FSM) are truncated.
> But visibilitymap_truncate() has already modified and dirtied the page. If
> the VM page change is flushed to disk before the WAL record, and you crash,
> you might have a torn VM page and a checksum failure.

Good point! I didn't think about that.

> Simply replacing the MarkBufferDirtyHint() call with MarkBufferDirty() in
> FreeSpaceMapTruncateRel would have the same issue. If you call
> MarkBufferDirty(), you must WAL-log the change, and also set the page's LSN
> to make sure the WAL record is flushed first.

Right.. All the code paths calling FreeSpaceMapTruncateRel or
visibilitymap_truncate do flush the record, but it definitely make
things more robust to set the LSN on the page. So +1 this way.

> I think we need something like the attached.

OK, I had a look at that.

MarkBufferDirty(mapBuffer);
+ if (lsn != InvalidXLogRecPtr)
+ PageSetLSN(page, lsn);
UnlockReleaseBuffer(mapBuffer);
Nit: using XLogRecPtrIsInvalid() makes more sense for such checks?

pg_visibility calls as well visibilitymap_truncate, but this patch did
not update it. And it would be better to do the actual truncate after
writing the WAL record I think.

You are also breaking XLogFlush() in RelationTruncate() because vm and
fsm are assigned thanks to a call of smgrexists(), something done
before XLogFlush is called on HEAD, and after with your patch. So your
patch finishes by never calling XLogFlush() on relation truncation
even if there is a VM or a FSM.

Attached is an updated version with those issues fixed. The final
version does not need the test as that's really a corner-case, but I
still included it to help testing for now.
--
Michael

Attachment Content-Type Size
truncation-vm-fsm.patch application/x-download 13.1 KB

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2016-10-18 03:34:52 Re: Re: [COMMITTERS] pgsql: Replace PostmasterRandom() with a stronger way of generating ran
Previous Message Amit Kapila 2016-10-18 03:08:02 Re: Parallel Index Scans