Re: Avoiding smgrimmedsync() during nbtree index builds

From: Andres Freund <andres(at)anarazel(dot)de>
To: Melanie Plageman <melanieplageman(at)gmail(dot)com>
Cc: Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>, Peter Geoghegan <pg(at)bowt(dot)ie>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>
Subject: Re: Avoiding smgrimmedsync() during nbtree index builds
Date: 2021-12-09 19:33:51
Message-ID: 20211209193351.e5lzpvxgettg5bx6@alap3.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

On 2021-11-19 15:11:57 -0500, Melanie Plageman wrote:
> From 2130175c5d794f60c5f15d6eb1b626c81fb7c39b Mon Sep 17 00:00:00 2001
> From: Melanie Plageman <melanieplageman(at)gmail(dot)com>
> Date: Thu, 15 Apr 2021 07:01:01 -0400
> Subject: [PATCH v2] Index build avoids immed fsync
>
> Avoid immediate fsync for just built indexes either by using shared
> buffers or by leveraging checkpointer's SyncRequest queue. When a
> checkpoint begins during the index build, if not using shared buffers,
> the backend will have to do its own fsync.
> ---
> src/backend/access/nbtree/nbtree.c | 39 +++---
> src/backend/access/nbtree/nbtsort.c | 186 +++++++++++++++++++++++-----
> src/backend/access/transam/xlog.c | 14 +++
> src/include/access/xlog.h | 1 +
> 4 files changed, 188 insertions(+), 52 deletions(-)
>
> diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
> index 40ad0956e0..a2e32f18e6 100644
> --- a/src/backend/access/nbtree/nbtree.c
> +++ b/src/backend/access/nbtree/nbtree.c
> @@ -150,30 +150,29 @@ void
> btbuildempty(Relation index)
> {
> Page metapage;
> + Buffer metabuf;
>
> - /* Construct metapage. */
> - metapage = (Page) palloc(BLCKSZ);
> - _bt_initmetapage(metapage, P_NONE, 0, _bt_allequalimage(index, false));
> -
> + // TODO: test this.

Shouldn't this path have plenty coverage?

> /*
> - * Write the page and log it. It might seem that an immediate sync would
> - * be sufficient to guarantee that the file exists on disk, but recovery
> - * itself might remove it while replaying, for example, an
> - * XLOG_DBASE_CREATE or XLOG_TBLSPC_CREATE record. Therefore, we need
> - * this even when wal_level=minimal.
> + * Construct metapage.
> + * Because we don't need to lock the relation for extension (since
> + * noone knows about it yet) and we don't need to initialize the
> + * new page, as it is done below by _bt_blnewpage(), _bt_getbuf()
> + * (with P_NEW and BT_WRITE) is overkill.

Isn't the more relevant operation the log_newpage_buffer()?

> However, it might be worth
> + * either modifying it or adding a new helper function instead of
> + * calling ReadBufferExtended() directly. We pass mode RBM_ZERO_AND_LOCK
> + * because we want to hold an exclusive lock on the buffer content
> */

"modifying it" refers to what?

I don't see a problem using ReadBufferExtended() here. Why would you like to
replace it with something else?

> + /*
> + * Based on the number of tuples, either create a buffered or unbuffered
> + * write state. if the number of tuples is small, make a buffered write
> + * if the number of tuples is larger, then we make an unbuffered write state
> + * and must ensure that we check the redo pointer to know whether or not we
> + * need to fsync ourselves
> + */
>
> /*
> * Finish the build by (1) completing the sort of the spool file, (2)
> * inserting the sorted tuples into btree pages and (3) building the upper
> * levels. Finally, it may also be necessary to end use of parallelism.
> */
> - _bt_leafbuild(buildstate.spool, buildstate.spool2);
> + if (reltuples > 1000)

I'm ok with some random magic constant here, but it seems worht putting it in
some constant / #define to make it more obvious.

> + _bt_leafbuild(buildstate.spool, buildstate.spool2, false);
> + else
> + _bt_leafbuild(buildstate.spool, buildstate.spool2, true);

Why duplicate the function call?

> /*
> * allocate workspace for a new, clean btree page, not linked to any siblings.
> + * If index is not built in shared buffers, buf should be InvalidBuffer
> */
> static Page
> -_bt_blnewpage(uint32 level)
> +_bt_blnewpage(uint32 level, Buffer buf)
> {
> Page page;
> BTPageOpaque opaque;
>
> - page = (Page) palloc(BLCKSZ);
> + if (buf)
> + page = BufferGetPage(buf);
> + else
> + page = (Page) palloc(BLCKSZ);
>
> /* Zero the page and set up standard page header info */
> _bt_pageinit(page, BLCKSZ);

Ick, that seems pretty ugly API-wise and subsequently too likely to lead to
pfree()ing a page that's actually in shared buffers. I think it'd make sense
to separate the allocation from the initialization bits?

> @@ -635,8 +657,20 @@ _bt_blnewpage(uint32 level)
> * emit a completed btree page, and release the working storage.
> */
> static void
> -_bt_blwritepage(BTWriteState *wstate, Page page, BlockNumber blkno)
> +_bt_blwritepage(BTWriteState *wstate, Page page, BlockNumber blkno, Buffer buf)
> {
> + if (wstate->use_shared_buffers)
> + {
> + Assert(buf);
> + START_CRIT_SECTION();
> + MarkBufferDirty(buf);
> + if (wstate->btws_use_wal)
> + log_newpage_buffer(buf, true);
> + END_CRIT_SECTION();
> + _bt_relbuf(wstate->index, buf);
> + return;
> + }
> +
> /* XLOG stuff */
> if (wstate->btws_use_wal)
> {
> @@ -659,7 +693,7 @@ _bt_blwritepage(BTWriteState *wstate, Page page, BlockNumber blkno)
> smgrextend(RelationGetSmgr(wstate->index), MAIN_FORKNUM,
> wstate->btws_pages_written++,
> (char *) wstate->btws_zeropage,
> - true);
> + false);
> }

Is there a good place to document the way we ensure durability for this path?

> + /*
> + * Extend the index relation upfront to reserve the metapage
> + */
> + if (wstate->use_shared_buffers)
> + {
> + /*
> + * We should not need to LockRelationForExtension() as no one else knows
> + * about this index yet?
> + * Extend the index relation by one block for the metapage. _bt_getbuf()
> + * is not used here as it does _bt_pageinit() which is one later by

*done

> + * _bt_initmetapage(). We will fill in the metapage and write it out at
> + * the end of index build when we have all of the information required
> + * for the metapage. However, we initially extend the relation for it to
> + * occupy block 0 because it is much easier when using shared buffers to
> + * extend the relation with a block number that is always increasing by
> + * 1.

Not quite following what you're trying to get at here. There generally is no
way to extend a relation except by increasing block numbers?

> @@ -1425,7 +1544,10 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
> * still not be on disk when the crash occurs.
> */
> if (wstate->btws_use_wal)
> - smgrimmedsync(RelationGetSmgr(wstate->index), MAIN_FORKNUM);
> + {
> + if (!wstate->use_shared_buffers && RedoRecPtrChanged(wstate->redo))
> + smgrimmedsync(RelationGetSmgr(wstate->index), MAIN_FORKNUM);
> + }
> }
>
> /*

This needs documentation. The whole comment above isn't accurate anymore afaict?

> diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
> index 1616448368..63fd212787 100644
> --- a/src/backend/access/transam/xlog.c
> +++ b/src/backend/access/transam/xlog.c
> @@ -8704,6 +8704,20 @@ GetRedoRecPtr(void)
> return RedoRecPtr;
> }
>
> +bool
> +RedoRecPtrChanged(XLogRecPtr comparator_ptr)
> +{
> + XLogRecPtr ptr;
> +
> + SpinLockAcquire(&XLogCtl->info_lck);
> + ptr = XLogCtl->RedoRecPtr;
> + SpinLockRelease(&XLogCtl->info_lck);
> +
> + if (RedoRecPtr < ptr)
> + RedoRecPtr = ptr;
> + return RedoRecPtr != comparator_ptr;
> +}

What's the deal with the < comparison?

Greetings,

Andres Freund

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Andres Freund 2021-12-09 19:38:29 Re: A test for replay of regression tests
Previous Message John Naylor 2021-12-09 19:28:18 do only critical work during single-user vacuum?