Re: Rework the way multixact truncations work

From: Andres Freund <andres(at)anarazel(dot)de>
To: Noah Misch <noah(at)leadboat(dot)com>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Peter Geoghegan <pg(at)heroku(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Rework the way multixact truncations work
Date: 2015-12-02 15:46:26
Message-ID: 20151202154626.GH5136@alap3.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 2015-12-02 09:57:19 -0500, Noah Misch wrote:
> On Tue, Dec 01, 2015 at 05:07:15PM -0500, Robert Haas wrote:
> > I think it's weird to back a commit out only to put a bunch of very similar
> > stuff back in.
>
> I agree with that. If the original patches and their replacements shared 95%
> of diff lines in common, we wouldn't be having this conversation. These
> replacements redo closer to 50% of the lines, so the patches are not very
> similar by verbatim line comparison.

Which is a huge problem, because it makes it very hard to see what your
changes actually do. And a significant portion of the changes relative
to master aren't particularly critical. Which is easy to see if if a
commit only changes comments, but harder if you see one commit reverting
things, and another redoing most of the same things.

> Hackers have been too reticent to revert and redo defect-laden
> commits. If doing that is weird today, let it be normal.

Why? Especially if reverting and redoing includes conflicts that mainly
increases the chance of accidental bugs.

> git remote add community git://git.postgresql.org/git/postgresql.git
> git remote add nmisch_github https://github.com/nmisch/postgresql.git
> git fetch --multiple community nmisch_github
> git diff community/master...nmisch_github/mxt4-rm-legacy

That's a nearly useless diff, because it includes a lot of other changes
(218 files changed, 2828 insertions(+), 8742 deletions(-)) made since
you published the changes. What kinda works is
git diff $(git merge-base community/master nmisch_github/mxt4-rm-legacy)..nmisch_github/mxt4-rm-legacy
which shows the diff to the version of master you start off from.

Review of the above diff:
> @@ -2013,7 +2017,7 @@ TrimMultiXact(void)
> {
> MultiXactId nextMXact;
> MultiXactOffset offset;
> - MultiXactId oldestMXact;
> + MultiXactId oldestMXact;

That's a bit weird, given that nextMXact isn't indented...

> @@ -2190,7 +2194,8 @@ MultiXactSetNextMXact(MultiXactId nextMulti,
> - /*
> - * Computing the actual limits is only possible once the data directory is
> - * in a consistent state. There's no need to compute the limits while
> - * still replaying WAL - no decisions about new multis are made even
> - * though multixact creations might be replayed. So we'll only do further
> - * checks after TrimMultiXact() has been called.
> - */
> + /* Before the TrimMultiXact() call at end of recovery, skip the rest. */
> if (!MultiXactState->finishedStartup)
> return;
> -
> Assert(!InRecovery);
>
> - /* Set limits for offset vacuum. */
> + /*
> + * Setting MultiXactState->oldestOffset entails a find_multixact_start()
> + * call, which is only possible once the data directory is in a consistent
> + * state. There's no need for an offset limit while still replaying WAL;
> + * no decisions about new multis are made even though multixact creations
> + * might be replayed.
> + */
> needs_offset_vacuum = SetOffsetVacuumLimit();

I don't really see the benefit of this change. The previous location of
the comment is where we return early, so it seems appropriate to
document the reason there?

> /*
> @@ -2354,6 +2356,12 @@ MultiXactAdvanceNextMXact(MultiXactId minMulti,
> debug_elog3(DEBUG2, "MultiXact: setting next multi to %u", minMulti);
> MultiXactState->nextMXact = minMulti;
> }
> +
> + /*
> + * MultiXactOffsetPrecedes() gives the wrong answer if nextOffset would
> + * advance more than 2^31 between calls. Since we get a call for each
> + * XLOG_MULTIXACT_CREATE_ID, that should never happen.
> + */

Independent comment improvement. Good idea though.

> /*
> - * Update our oldestMultiXactId value, but only if it's more recent than what
> - * we had.
> - *
> - * This may only be called during WAL replay.
> + * Update our oldestMultiXactId value, but only if it's more recent than
> + * what we had. This may only be called during WAL replay.
> */

Whatever?

> void
> MultiXactAdvanceOldest(MultiXactId oldestMulti, Oid oldestMultiDB)
> @@ -2544,14 +2550,13 @@ GetOldestMultiXactId(void)
> static bool
> SetOffsetVacuumLimit(void)
> {
> - MultiXactId oldestMultiXactId;
> + MultiXactId oldestMultiXactId;
> MultiXactId nextMXact;
> - MultiXactOffset oldestOffset = 0; /* placate compiler */
> - MultiXactOffset prevOldestOffset;
> + MultiXactOffset oldestOffset = 0; /* placate compiler */
> MultiXactOffset nextOffset;
> bool oldestOffsetKnown = false;
> + MultiXactOffset prevOldestOffset;
> bool prevOldestOffsetKnown;
> - MultiXactOffset offsetStopLimit = 0;

I don't see the benefit of the order changes here.

> @@ -2588,40 +2590,50 @@ SetOffsetVacuumLimit(void)
> else
> {
> /*
> - * Figure out where the oldest existing multixact's offsets are
> - * stored. Due to bugs in early release of PostgreSQL 9.3.X and 9.4.X,
> - * the supposedly-earliest multixact might not really exist. We are
> + * Figure out where the oldest existing multixact's offsets are stored.
> + * Due to bugs in early release of PostgreSQL 9.3.X and 9.4.X, the
> + * supposedly-earliest multixact might not really exist. We are
> * careful not to fail in that case.
> */
> oldestOffsetKnown =
> find_multixact_start(oldestMultiXactId, &oldestOffset);
> -
> - if (oldestOffsetKnown)
> - ereport(DEBUG1,
> - (errmsg("oldest MultiXactId member is at offset %u",
> - oldestOffset)));

That's imo a rather useful debug message.

> - else
> + if (!oldestOffsetKnown)
> + {
> + /* XXX This message is incorrect if prevOldestOffsetKnown. */
> ereport(LOG,
> (errmsg("MultiXact member wraparound protections are disabled because oldest checkpointed MultiXact %u does not exist on disk",
> oldestMultiXactId)));
> + }
> }

Hm, the XXX is a "problem" in all 9.3+ - should we just fix it everywhere?

> LWLockRelease(MultiXactTruncationLock);
>
> /*
> - * If we can, compute limits (and install them MultiXactState) to prevent
> - * overrun of old data in the members SLRU area. We can only do so if the
> - * oldest offset is known though.
> + * There's no need to update anything if we don't know the oldest offset
> + * or if it hasn't changed.
> */

Is that really a worthwhile optimization?

> -typedef struct mxtruncinfo
> -{
> - int earliestExistingPage;
> -} mxtruncinfo;
> -
> -/*
> - * SlruScanDirectory callback
> - * This callback determines the earliest existing page number.
> - */
> -static bool
> -SlruScanDirCbFindEarliest(SlruCtl ctl, char *filename, int segpage, void *data)
> -{
> - mxtruncinfo *trunc = (mxtruncinfo *) data;
> -
> - if (trunc->earliestExistingPage == -1 ||
> - ctl->PagePrecedes(segpage, trunc->earliestExistingPage))
> - {
> - trunc->earliestExistingPage = segpage;
> - }
> -
> - return false; /* keep going */
> -}
> -

That really seems like an independent change, deserving its own commit +
explanation. Just referring to "See mailing list submission notes for
rationale." makes understanding the change later imo much harder than
all the incremental commits you try to avoid.

> /*
> - * Decide which of two MultiXactMember page numbers is "older" for truncation
> - * purposes. There is no "invalid offset number" so use the numbers verbatim.
> + * Dummy notion of which of two MultiXactMember page numbers is "older".
> + *
> + * Due to the MultiXactOffsetPrecedes() specification, this function's result
> + * is meaningless unless the system is preserving less than 2^31 members. It
> + * is adequate for SlruSelectLRUPage() guessing the cheapest slot to reclaim.
> + * Do not pass MultiXactMemberCtl to any of the functions that use the
> + * PagePrecedes callback in other ways.
> */
> static bool
> MultiXactMemberPagePrecedes(int page1, int page2)
> @@ -3157,6 +3134,10 @@ MultiXactIdPrecedesOrEquals(MultiXactId multi1, MultiXactId multi2)
>
> /*
> * Decide which of two offsets is earlier.
> + *
> + * Avoid calling this function. pg_multixact/members can preserve almost 2^32
> + * members at any given time, but this function is transitive only when the
> + * system is preserving less than 2^31 members.
> */
> static bool
> MultiXactOffsetPrecedes(MultiXactOffset offset1, MultiXactOffset offset2)

As mentioned before, these really seem unrelated.

> @@ -1237,24 +1235,25 @@ SlruDeleteSegment(SlruCtl ctl, int segno)
> SlruShared shared = ctl->shared;
> int slotno;
> char path[MAXPGPATH];
> - bool did_write;
>
> /* Clean out any possibly existing references to the segment. */
> LWLockAcquire(shared->ControlLock, LW_EXCLUSIVE);
> restart:
> - did_write = false;
> for (slotno = 0; slotno < shared->num_slots; slotno++)
> {
> - int pagesegno = shared->page_number[slotno] / SLRU_PAGES_PER_SEGMENT;
> + int pagesegno;
>
> if (shared->page_status[slotno] == SLRU_PAGE_EMPTY)
> continue;
>
> /* not the segment we're looking for */
> + pagesegno = shared->page_number[slotno] / SLRU_PAGES_PER_SEGMENT;
> if (pagesegno != segno)
> continue;
>
> - /* If page is clean, just change state to EMPTY (expected case). */
> + /*
> + * If page is clean, just change state to EMPTY (expected case).
> + */
> if (shared->page_status[slotno] == SLRU_PAGE_VALID &&
> !shared->page_dirty[slotno])
> {
> @@ -1267,18 +1266,10 @@ restart:
> SlruInternalWritePage(ctl, slotno, NULL);
> else
> SimpleLruWaitIO(ctl, slotno);
> -
> - did_write = true;
> - }
> -
> - /*
> - * Be extra careful and re-check. The IO functions release the control
> - * lock, so new pages could have been read in.
> - */
> - if (did_write)
> goto restart;
> + }

I don't think that's really a good idea - this way we restart after
every single page write, whereas currently we only restart after passing
through all pages once. In nearly all cases we'll only ever have to
retry once in total, be because such old pages aren't usually going to
be reread/redirtied.

> @@ -9216,10 +9212,8 @@ xlog_redo(XLogReaderState *record)
> LWLockRelease(OidGenLock);
> MultiXactSetNextMXact(checkPoint.nextMulti,
> checkPoint.nextMultiOffset);
> -
> - MultiXactAdvanceOldest(checkPoint.oldestMulti,
> - checkPoint.oldestMultiDB);
> SetTransactionIdLimit(checkPoint.oldestXid, checkPoint.oldestXidDB);
> + SetMultiXactIdLimit(checkPoint.oldestMulti, checkPoint.oldestMultiDB);
>
> /*
> * If we see a shutdown checkpoint while waiting for an end-of-backup
> @@ -9309,17 +9303,13 @@ xlog_redo(XLogReaderState *record)
> LWLockRelease(OidGenLock);
> MultiXactAdvanceNextMXact(checkPoint.nextMulti,
> checkPoint.nextMultiOffset);
> -
> - /*
> - * NB: This may perform multixact truncation when replaying WAL
> - * generated by an older primary.
> - */
> - MultiXactAdvanceOldest(checkPoint.oldestMulti,
> - checkPoint.oldestMultiDB);
> if (TransactionIdPrecedes(ShmemVariableCache->oldestXid,
> checkPoint.oldestXid))
> SetTransactionIdLimit(checkPoint.oldestXid,
> checkPoint.oldestXidDB);
> + MultiXactAdvanceOldest(checkPoint.oldestMulti,
> + checkPoint.oldestMultiDB);
> +
> /* ControlFile->checkPointCopy always tracks the latest ckpt XID */
> ControlFile->checkPointCopy.nextXidEpoch = checkPoint.nextXidEpoch;
> ControlFile->checkPointCopy.nextXid =
> checkPoint.nextXid;

Why?

> diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
> index 7c4ef58..e2b4f4c 100644
> --- a/src/backend/commands/vacuum.c
> +++ b/src/backend/commands/vacuum.c
> @@ -1136,9 +1136,6 @@ vac_truncate_clog(TransactionId frozenXID,
> if (bogus)
> return;
>
> - /*
> - * Truncate CLOG, multixact and CommitTs to the oldest computed value.
> - */
> TruncateCLOG(frozenXID);
> TruncateCommitTs(frozenXID);
> TruncateMultiXact(minMulti, minmulti_datoid);

Why? Sure, it's not a super important comment, but ...?

> diff --git a/src/backend/storage/ipc/procarray.c b/src/backend/storage/ipc/procarray.c
> index 0dc4117..41e51cf 100644
> --- a/src/backend/storage/ipc/procarray.c
> +++ b/src/backend/storage/ipc/procarray.c
> @@ -2192,10 +2192,8 @@ GetOldestSafeDecodingTransactionId(void)
>
> /*
> * GetVirtualXIDsDelayingChkpt -- Get the VXIDs of transactions that are
> - * delaying checkpoint because they have critical actions in progress.
> - *
> - * Constructs an array of VXIDs of transactions that are currently in commit
> - * critical sections, as shown by having delayChkpt set in their PGXACT.
> + * delaying checkpoint because they have critical actions in progress, as
> + * shown by having delayChkpt set in their PGXACT.
>
> * Returns a palloc'd array that should be freed by the caller.
> * *nvxids is the number of valid entries.
> @@ -2204,8 +2202,8 @@ GetOldestSafeDecodingTransactionId(void)
> * the result is somewhat indeterminate, but we don't really care. Even in
> * a multiprocessor with delayed writes to shared memory, it should be certain
> * that setting of delayChkpt will propagate to shared memory when the backend
> - * takes a lock, so we cannot fail to see a virtual xact as delayChkpt if
> - * it's already inserted its commit record. Whether it takes a little while
> + * takes a lock, so we cannot fail to see a virtual xact as delayChkpt if it's
> + * already inserted its critical xlog record. Whether it takes a little while
> * for clearing of delayChkpt to propagate is unimportant for correctness.
> */

Seems unrelated, given that this is already used in
MarkBufferDirtyHint(). Don't get me wrong, I think the changes are a
good idea, but it's not really tied to the truncation changes.

- Andres

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2015-12-02 16:07:58 psql ignores failure to open -o target file
Previous Message Robert Haas 2015-12-02 15:29:11 Re: Speed up Clog Access by increasing CLOG buffers