Re: Re: Multixact slru doesn't don't force WAL flushes in SlruPhysicalWritePage()

From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Noah Misch <noah(at)leadboat(dot)com>
Cc: Andres Freund <andres(at)anarazel(dot)de>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>
Subject: Re: Re: Multixact slru doesn't don't force WAL flushes in SlruPhysicalWritePage()
Date: 2015-12-01 21:59:08
Message-ID: CA+TgmoZsAOwgJyJ+6iAd4pw=8LHLN==A7_O0hJdVSH7wG0RGmg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Sat, Nov 28, 2015 at 11:15 PM, Noah Misch <noah(at)leadboat(dot)com> wrote:
> On Tue, Nov 10, 2015 at 11:22:47PM -0500, Noah Misch wrote:
>> On Mon, Nov 09, 2015 at 10:40:07PM +0100, Andres Freund wrote:
>> > /*
>> > * Optional array of WAL flush LSNs associated with entries in the SLRU
>> > * pages. If not zero/NULL, we must flush WAL before writing pages (true
>> > * for pg_clog, false for multixact, pg_subtrans, pg_notify). group_lsn[]
>> > * has lsn_groups_per_page entries per buffer slot, each containing the
>> > * highest LSN known for a contiguous group of SLRU entries on that slot's
>> > * page.
>> > */
>> > XLogRecPtr *group_lsn;
>> > int lsn_groups_per_page;
>> >
>
>> Here's the multixact.c comment justifying it:
>>
>> * XLOG interactions: this module generates an XLOG record whenever a new
>> * OFFSETs or MEMBERs page is initialized to zeroes, as well as an XLOG record
>> * whenever a new MultiXactId is defined. This allows us to completely
>> * rebuild the data entered since the last checkpoint during XLOG replay.
>> * Because this is possible, we need not follow the normal rule of
>> * "write WAL before data"; the only correctness guarantee needed is that
>> * we flush and sync all dirty OFFSETs and MEMBERs pages to disk before a
>> * checkpoint is considered complete. If a page does make it to disk ahead
>> * of corresponding WAL records, it will be forcibly zeroed before use anyway.
>> * Therefore, we don't need to mark our pages with LSN information; we have
>> * enough synchronization already.
>>
>> The comment's justification is incomplete, though. What of pages filled over
>> the course of multiple checkpoint cycles?
>
> Having studied this further, I think it is safe for a reason given elsewhere:
>
> * Note: we need not flush this XLOG entry to disk before proceeding. The
> * only way for the MXID to be referenced from any data page is for
> * heap_lock_tuple() to have put it there, and heap_lock_tuple() generates
> * an XLOG record that must follow ours. The normal LSN interlock between
> * the data page and that XLOG record will ensure that our XLOG record
> * reaches disk first. If the SLRU members/offsets data reaches disk
> * sooner than the XLOG record, we do not care because we'll overwrite it
> * with zeroes unless the XLOG record is there too; see notes at top of
> * this file.
>
> I find no flaw in the first three sentences. In most cases, one of
> multixact_redo(), TrimMultiXact() or ExtendMultiXactOffset() will indeed
> overwrite the widowed mxid data with zeroes before the system again writes
> data in that vicinity. We can fail to do that if a backend stalls just after
> calling GetNewMultiXactId(), then recovers and updates SLRU pages long after
> other backends filled the balance of those pages. That's still okay; if we
> never flush the XLOG_MULTIXACT_CREATE_ID record, no xmax referring to the mxid
> will survive recovery. I drafted the attached comment update to consolidate
> and slightly correct this justification. (If we were designing the multixact
> subsystem afresh today, I'd vote for following the write-WAL-before-data rule
> despite believing it is not needed. The simplicity would be worth it.)
>
> While contemplating the "stalls ... just after calling GetNewMultiXactId()"
> scenario, I noticed a race condition not involving any write-WAL-before-data
> violation. MultiXactIdCreateFromMembers() and callees have this skeleton:
>
> GetNewMultiXactId
> LWLockAcquire(MultiXactGenLock, LW_EXCLUSIVE)
> ExtendMultiXactOffset()
> ExtendMultiXactMember()
> START_CRIT_SECTION()
> (MultiXactState->nextMXact)++
> MultiXactState->nextOffset += nmembers
> LWLockRelease(MultiXactGenLock)
> XLogInsert(RM_MULTIXACT_ID, XLOG_MULTIXACT_CREATE_ID)
> RecordNewMultiXact
> LWLockAcquire(MultiXactOffsetControlLock, LW_EXCLUSIVE)
> *offptr = offset; /* update pg_multixact/offsets SLRU page */
> LWLockRelease(MultiXactOffsetControlLock)
> LWLockAcquire(MultiXactMemberControlLock, LW_EXCLUSIVE)
> *memberptr = members[i].xid; /* update pg_multixact/members SLRU page */
> *flagsptr = flagsval; /* update pg_multixact/members SLRU page */
> LWLockRelease(MultiXactMemberControlLock)
> END_CRIT_SECTION
>
> Between GetNewMultiXactId() and XLogInsert(), other backends will be free to
> create higher-numbered mxids. They will skip over SLRU space the current
> backend reserved. Future GetMultiXactIdMembers() calls rely critically on the
> current backend eventually finishing:
>
> * 2. The next multixact may still be in process of being filled in: that
> * is, another process may have done GetNewMultiXactId but not yet written
> * the offset entry for that ID. In that scenario, it is guaranteed that
> * the offset entry for that multixact exists (because GetNewMultiXactId
> * won't release MultiXactGenLock until it does) but contains zero
> * (because we are careful to pre-zero offset pages). Because
> * GetNewMultiXactId will never return zero as the starting offset for a
> * multixact, when we read zero as the next multixact's offset, we know we
> * have this case. We sleep for a bit and try again.
>
> A crash while paused this way makes permanent the zero offset. After
> recovery, though no xmax carries the offset-zero mxid, GetMultiXactIdMembers()
> on the immediate previous mxid hangs indefinitely. Also,
> pg_get_multixact_members(zero_offset_mxid) gets an assertion failure. I have
> not thoroughly considered how best to fix these. Test procedure:
>
> -- backend 1
> checkpoint;
> create table victim0 (c) as select 1;
> create table stall1 (c) as select 1;
> create table last2 (c) as select 1;
> begin;
> select c from victim0 for key share;
> select c from stall1 for key share;
> select c from last2 for key share;
>
> -- backend 2
> begin; update victim0 set c = c + 1; rollback; -- burn one mxid
> -- In a shell, attach GDB to backend 2. GDB will stop the next SQL command at
> -- XLogInsert() in MultiXactIdCreateFromMembers():
> -- gdb --pid $BACKEND_PID -ex 'b XLogInsert' -ex c
> select c from stall1 for key share; -- stall in gdb while making an mxid
>
> -- backend 3
> select c from last2 for key share; -- use one more mxid; flush WAL
>
> -- in GDB session, issue command: kill
>
> -- backend 1, after recovery
> select c from victim0; -- hangs, uncancelable
>
> -- backend 2, after recovery: assertion failure
> select pg_get_multixact_members((xmax::text::bigint - 1)::text::xid) from last2;

For tracking purposes, I updated
https://wiki.postgresql.org/wiki/MultiXact_Bugs and added this. We're
probably past due to spend some time cleaning up the older issues that
Andres's patch didn't fix (although it fixed a lot).

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Tomas Vondra 2015-12-01 22:00:10 Re: silent data loss with ext4 / all current versions
Previous Message Bruce Momjian 2015-12-01 21:55:51 Re: El Capitan Removes OpenSSL Headers