Quick Links

Re: Multixact slru doesn't don't force WAL flushes in SlruPhysicalWritePage()

From:	Noah Misch <noah(at)leadboat(dot)com>
To:	Andres Freund <andres(at)anarazel(dot)de>
Cc:	pgsql-hackers(at)postgresql(dot)org, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>
Subject:	Re: Multixact slru doesn't don't force WAL flushes in SlruPhysicalWritePage()
Date:	2015-11-29 04:15:49
Message-ID:	20151129041549.GA1852531@tornado.leadboat.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

On Tue, Nov 10, 2015 at 11:22:47PM -0500, Noah Misch wrote:
> On Mon, Nov 09, 2015 at 10:40:07PM +0100, Andres Freund wrote:
> > /*
> > * Optional array of WAL flush LSNs associated with entries in the SLRU
> > * pages. If not zero/NULL, we must flush WAL before writing pages (true
> > * for pg_clog, false for multixact, pg_subtrans, pg_notify). group_lsn[]
> > * has lsn_groups_per_page entries per buffer slot, each containing the
> > * highest LSN known for a contiguous group of SLRU entries on that slot's
> > * page.
> > */
> > XLogRecPtr *group_lsn;
> > int lsn_groups_per_page;
> >

> Here's the multixact.c comment justifying it:
>
> * XLOG interactions: this module generates an XLOG record whenever a new
> * OFFSETs or MEMBERs page is initialized to zeroes, as well as an XLOG record
> * whenever a new MultiXactId is defined. This allows us to completely
> * rebuild the data entered since the last checkpoint during XLOG replay.
> * Because this is possible, we need not follow the normal rule of
> * "write WAL before data"; the only correctness guarantee needed is that
> * we flush and sync all dirty OFFSETs and MEMBERs pages to disk before a
> * checkpoint is considered complete. If a page does make it to disk ahead
> * of corresponding WAL records, it will be forcibly zeroed before use anyway.
> * Therefore, we don't need to mark our pages with LSN information; we have
> * enough synchronization already.
>
> The comment's justification is incomplete, though. What of pages filled over
> the course of multiple checkpoint cycles?

Having studied this further, I think it is safe for a reason given elsewhere:

* Note: we need not flush this XLOG entry to disk before proceeding. The
* only way for the MXID to be referenced from any data page is for
* heap_lock_tuple() to have put it there, and heap_lock_tuple() generates
* an XLOG record that must follow ours. The normal LSN interlock between
* the data page and that XLOG record will ensure that our XLOG record
* reaches disk first. If the SLRU members/offsets data reaches disk
* sooner than the XLOG record, we do not care because we'll overwrite it
* with zeroes unless the XLOG record is there too; see notes at top of
* this file.

I find no flaw in the first three sentences. In most cases, one of
multixact_redo(), TrimMultiXact() or ExtendMultiXactOffset() will indeed
overwrite the widowed mxid data with zeroes before the system again writes
data in that vicinity. We can fail to do that if a backend stalls just after
calling GetNewMultiXactId(), then recovers and updates SLRU pages long after
other backends filled the balance of those pages. That's still okay; if we
never flush the XLOG_MULTIXACT_CREATE_ID record, no xmax referring to the mxid
will survive recovery. I drafted the attached comment update to consolidate
and slightly correct this justification. (If we were designing the multixact
subsystem afresh today, I'd vote for following the write-WAL-before-data rule
despite believing it is not needed. The simplicity would be worth it.)

While contemplating the "stalls ... just after calling GetNewMultiXactId()"
scenario, I noticed a race condition not involving any write-WAL-before-data
violation. MultiXactIdCreateFromMembers() and callees have this skeleton:

GetNewMultiXactId
LWLockAcquire(MultiXactGenLock, LW_EXCLUSIVE)
ExtendMultiXactOffset()
ExtendMultiXactMember()
START_CRIT_SECTION()
(MultiXactState->nextMXact)++
MultiXactState->nextOffset += nmembers
LWLockRelease(MultiXactGenLock)
XLogInsert(RM_MULTIXACT_ID, XLOG_MULTIXACT_CREATE_ID)
RecordNewMultiXact
LWLockAcquire(MultiXactOffsetControlLock, LW_EXCLUSIVE)
*offptr = offset; /* update pg_multixact/offsets SLRU page */
LWLockRelease(MultiXactOffsetControlLock)
LWLockAcquire(MultiXactMemberControlLock, LW_EXCLUSIVE)
*memberptr = members[i].xid; /* update pg_multixact/members SLRU page */
*flagsptr = flagsval; /* update pg_multixact/members SLRU page */
LWLockRelease(MultiXactMemberControlLock)
END_CRIT_SECTION

Between GetNewMultiXactId() and XLogInsert(), other backends will be free to
create higher-numbered mxids. They will skip over SLRU space the current
backend reserved. Future GetMultiXactIdMembers() calls rely critically on the
current backend eventually finishing:

* 2. The next multixact may still be in process of being filled in: that
* is, another process may have done GetNewMultiXactId but not yet written
* the offset entry for that ID. In that scenario, it is guaranteed that
* the offset entry for that multixact exists (because GetNewMultiXactId
* won't release MultiXactGenLock until it does) but contains zero
* (because we are careful to pre-zero offset pages). Because
* GetNewMultiXactId will never return zero as the starting offset for a
* multixact, when we read zero as the next multixact's offset, we know we
* have this case. We sleep for a bit and try again.

A crash while paused this way makes permanent the zero offset. After
recovery, though no xmax carries the offset-zero mxid, GetMultiXactIdMembers()
on the immediate previous mxid hangs indefinitely. Also,
pg_get_multixact_members(zero_offset_mxid) gets an assertion failure. I have
not thoroughly considered how best to fix these. Test procedure:

-- backend 1
checkpoint;
create table victim0 (c) as select 1;
create table stall1 (c) as select 1;
create table last2 (c) as select 1;
begin;
select c from victim0 for key share;
select c from stall1 for key share;
select c from last2 for key share;

-- backend 2
begin; update victim0 set c = c + 1; rollback; -- burn one mxid
-- In a shell, attach GDB to backend 2. GDB will stop the next SQL command at
-- XLogInsert() in MultiXactIdCreateFromMembers():
-- gdb --pid $BACKEND_PID -ex 'b XLogInsert' -ex c
select c from stall1 for key share; -- stall in gdb while making an mxid

-- backend 3
select c from last2 for key share; -- use one more mxid; flush WAL

-- in GDB session, issue command: kill

-- backend 1, after recovery
select c from victim0; -- hangs, uncancelable

-- backend 2, after recovery: assertion failure
select pg_get_multixact_members((xmax::text::bigint - 1)::text::xid) from last2;

Attachment	Content-Type	Size
doc-mxact-wal-before-data-v1.patch	text/plain	3.9 KB

In response to

Re: Multixact slru doesn't don't force WAL flushes in SlruPhysicalWritePage() at 2015-11-11 04:22:47 from Noah Misch

Responses

Re: Re: Multixact slru doesn't don't force WAL flushes in SlruPhysicalWritePage() at 2015-12-01 21:59:08 from Robert Haas

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Pavel Stehule	2015-11-29 07:00:58	Re: proposal: PL/Pythonu - function ereport
Previous Message	Peter Geoghegan	2015-11-29 01:00:01	Re: WIP: Make timestamptz_out less slow.