Re: VM map freeze corruption

From: Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>
To: Alvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org>
Cc: Pavan Deolasee <pavan(dot)deolasee(at)gmail(dot)com>, "Wood, Dan" <hexpert(at)amazon(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: VM map freeze corruption
Date: 2018-04-19 02:35:02
Message-ID: CAD21AoCqXqQdPTTRt_bUikpZFS89+6KO8AyuvhRdaq55x7LaKQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, Apr 18, 2018 at 10:36 PM, Alvaro Herrera
<alvherre(at)alvh(dot)no-ip(dot)org> wrote:
> Pavan Deolasee wrote:
>> On Wed, Apr 18, 2018 at 7:37 AM, Wood, Dan <hexpert(at)amazon(dot)com> wrote:
>
>> > My analysis is that heap_prepare_freeze_tuple->FreezeMultiXactId()
>> > returns FRM_NOOP if the MultiXACT locked rows haven't committed. This
>> > results in changed=false and totally_frozen=true(as initialized). When
>> > this returns to lazy_scan_heap(), no rows are added to the frozen[] array.
>> > Yet, tuple_totally_frozen is true. This means the page is marked frozen in
>> > the VM, even though the MultiXACT row wasn't left untouch.
>> >
>> > A fix to heap_prepare_freeze_tuple() that seems to do the trick is:
>> > else
>> > {
>> > Assert(flags & FRM_NOOP);
>> > + totally_frozen = false;
>> > }
>> >
>>
>> That's a great find!
>
> Indeed.
>
> This family of bugs (introduced by freeze map changes in 9.6) was
> initially fixed in 38e9f90a227d but this spot was missed in that fix.
>
> IMO the cause is the totally_frozen variable, which starts life in a
> bogus state (true) and the different code paths can set it to the right
> state, or by inaction end up deciding that the initial bogus state was
> correct in the first place. Errors of omission are far too easy in that
> kind of model, ISTM, so I propose this slightly different patch, which
> albeit yet untested seems easier to verify and easier to get right.
>

Thank you for the patch!
Agreed. The patch looks to fix this issue correctly.

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Peter Geoghegan 2018-04-19 02:58:15 Re: Adding an LWLockHeldByMe()-like function that reports if any buffer content lock is held
Previous Message Michael Paquier 2018-04-19 02:28:02 Re: Speedup of relation deletes during recovery