Re: Avoiding smgrimmedsync() during nbtree index builds

From: Melanie Plageman <melanieplageman(at)gmail(dot)com>
To: Andres Freund <andres(at)anarazel(dot)de>
Cc: Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>, Peter Geoghegan <pg(at)bowt(dot)ie>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>
Subject: Re: Avoiding smgrimmedsync() during nbtree index builds
Date: 2022-01-10 22:50:40
Message-ID: CAAKRu_ZBLVJ3S04YAOcbmzLtUPvbB+G62QDOUjojvV_My4hUOA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

I have attached a v3 which includes two commits -- one of which
implements the directmgr API and uses it and the other which adds
functionality to use either directmgr or bufmgr API during index build.

Also registering for march commitfest.

On Thu, Dec 9, 2021 at 2:33 PM Andres Freund <andres(at)anarazel(dot)de> wrote:
>
> Hi,
>
> On 2021-11-19 15:11:57 -0500, Melanie Plageman wrote:
> > From 2130175c5d794f60c5f15d6eb1b626c81fb7c39b Mon Sep 17 00:00:00 2001
> > From: Melanie Plageman <melanieplageman(at)gmail(dot)com>
> > Date: Thu, 15 Apr 2021 07:01:01 -0400
> > Subject: [PATCH v2] Index build avoids immed fsync
> >
> > Avoid immediate fsync for just built indexes either by using shared
> > buffers or by leveraging checkpointer's SyncRequest queue. When a
> > checkpoint begins during the index build, if not using shared buffers,
> > the backend will have to do its own fsync.
> > ---
> > src/backend/access/nbtree/nbtree.c | 39 +++---
> > src/backend/access/nbtree/nbtsort.c | 186 +++++++++++++++++++++++-----
> > src/backend/access/transam/xlog.c | 14 +++
> > src/include/access/xlog.h | 1 +
> > 4 files changed, 188 insertions(+), 52 deletions(-)
> >
> > diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
> > index 40ad0956e0..a2e32f18e6 100644
> > --- a/src/backend/access/nbtree/nbtree.c
> > +++ b/src/backend/access/nbtree/nbtree.c
> > @@ -150,30 +150,29 @@ void
> > btbuildempty(Relation index)
> > {
> > Page metapage;
> > + Buffer metabuf;
> >
> > - /* Construct metapage. */
> > - metapage = (Page) palloc(BLCKSZ);
> > - _bt_initmetapage(metapage, P_NONE, 0, _bt_allequalimage(index, false));
> > -
> > + // TODO: test this.
>
> Shouldn't this path have plenty coverage?

Yep. Sorry.

> > /*
> > - * Write the page and log it. It might seem that an immediate sync would
> > - * be sufficient to guarantee that the file exists on disk, but recovery
> > - * itself might remove it while replaying, for example, an
> > - * XLOG_DBASE_CREATE or XLOG_TBLSPC_CREATE record. Therefore, we need
> > - * this even when wal_level=minimal.
> > + * Construct metapage.
> > + * Because we don't need to lock the relation for extension (since
> > + * noone knows about it yet) and we don't need to initialize the
> > + * new page, as it is done below by _bt_blnewpage(), _bt_getbuf()
> > + * (with P_NEW and BT_WRITE) is overkill.
>
> Isn't the more relevant operation the log_newpage_buffer()?

Returning to this after some time away, many of my comments no longer
make sense to me either. I can't actually tell which diff your question
applies to because this comment was copy-pasted in two different places
in my code. Also, I've removed this comment and added new ones. So,
given all that, is there still something about log_newpage_buffer() I
should be commenting on?

> > However, it might be worth
> > + * either modifying it or adding a new helper function instead of
> > + * calling ReadBufferExtended() directly. We pass mode RBM_ZERO_AND_LOCK
> > + * because we want to hold an exclusive lock on the buffer content
> > */
>
> "modifying it" refers to what?
>
> I don't see a problem using ReadBufferExtended() here. Why would you like to
> replace it with something else?

I would just disregard these comments now.

> > + /*
> > + * Based on the number of tuples, either create a buffered or unbuffered
> > + * write state. if the number of tuples is small, make a buffered write
> > + * if the number of tuples is larger, then we make an unbuffered write state
> > + * and must ensure that we check the redo pointer to know whether or not we
> > + * need to fsync ourselves
> > + */
> >
> > /*
> > * Finish the build by (1) completing the sort of the spool file, (2)
> > * inserting the sorted tuples into btree pages and (3) building the upper
> > * levels. Finally, it may also be necessary to end use of parallelism.
> > */
> > - _bt_leafbuild(buildstate.spool, buildstate.spool2);
> > + if (reltuples > 1000)
>
> I'm ok with some random magic constant here, but it seems worht putting it in
> some constant / #define to make it more obvious.

Done.

> > + _bt_leafbuild(buildstate.spool, buildstate.spool2, false);
> > + else
> > + _bt_leafbuild(buildstate.spool, buildstate.spool2, true);
>
> Why duplicate the function call?

Fixed.

> > /*
> > * allocate workspace for a new, clean btree page, not linked to any siblings.
> > + * If index is not built in shared buffers, buf should be InvalidBuffer
> > */
> > static Page
> > -_bt_blnewpage(uint32 level)
> > +_bt_blnewpage(uint32 level, Buffer buf)
> > {
> > Page page;
> > BTPageOpaque opaque;
> >
> > - page = (Page) palloc(BLCKSZ);
> > + if (buf)
> > + page = BufferGetPage(buf);
> > + else
> > + page = (Page) palloc(BLCKSZ);
> >
> > /* Zero the page and set up standard page header info */
> > _bt_pageinit(page, BLCKSZ);
>
> Ick, that seems pretty ugly API-wise and subsequently too likely to lead to
> pfree()ing a page that's actually in shared buffers. I think it'd make sense
> to separate the allocation from the initialization bits?

Fixed.

> > @@ -635,8 +657,20 @@ _bt_blnewpage(uint32 level)
> > * emit a completed btree page, and release the working storage.
> > */
> > static void
> > -_bt_blwritepage(BTWriteState *wstate, Page page, BlockNumber blkno)
> > +_bt_blwritepage(BTWriteState *wstate, Page page, BlockNumber blkno, Buffer buf)
> > {
> > + if (wstate->use_shared_buffers)
> > + {
> > + Assert(buf);
> > + START_CRIT_SECTION();
> > + MarkBufferDirty(buf);
> > + if (wstate->btws_use_wal)
> > + log_newpage_buffer(buf, true);
> > + END_CRIT_SECTION();
> > + _bt_relbuf(wstate->index, buf);
> > + return;
> > + }
> > +
> > /* XLOG stuff */
> > if (wstate->btws_use_wal)
> > {
> > @@ -659,7 +693,7 @@ _bt_blwritepage(BTWriteState *wstate, Page page, BlockNumber blkno)
> > smgrextend(RelationGetSmgr(wstate->index), MAIN_FORKNUM,
> > wstate->btws_pages_written++,
> > (char *) wstate->btws_zeropage,
> > - true);
> > + false);
> > }
>
> Is there a good place to document the way we ensure durability for this path?

I added some new comments. Let me know if you think that I am still
missing this documentation.

> > + /*
> > + * Extend the index relation upfront to reserve the metapage
> > + */
> > + if (wstate->use_shared_buffers)
> > + {
> > + /*
> > + * We should not need to LockRelationForExtension() as no one else knows
> > + * about this index yet?
> > + * Extend the index relation by one block for the metapage. _bt_getbuf()
> > + * is not used here as it does _bt_pageinit() which is one later by
>
> *done
>
>
> > + * _bt_initmetapage(). We will fill in the metapage and write it out at
> > + * the end of index build when we have all of the information required
> > + * for the metapage. However, we initially extend the relation for it to
> > + * occupy block 0 because it is much easier when using shared buffers to
> > + * extend the relation with a block number that is always increasing by
> > + * 1.
>
> Not quite following what you're trying to get at here. There generally is no
> way to extend a relation except by increasing block numbers?

I've updated this comment too. It should make more sense now.

> > @@ -1425,7 +1544,10 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
> > * still not be on disk when the crash occurs.
> > */
> > if (wstate->btws_use_wal)
> > - smgrimmedsync(RelationGetSmgr(wstate->index), MAIN_FORKNUM);
> > + {
> > + if (!wstate->use_shared_buffers && RedoRecPtrChanged(wstate->redo))
> > + smgrimmedsync(RelationGetSmgr(wstate->index), MAIN_FORKNUM);
> > + }
> > }
> >
> > /*
>
> This needs documentation. The whole comment above isn't accurate anymore afaict?

Should be correct now.

> > diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
> > index 1616448368..63fd212787 100644
> > --- a/src/backend/access/transam/xlog.c
> > +++ b/src/backend/access/transam/xlog.c
> > @@ -8704,6 +8704,20 @@ GetRedoRecPtr(void)
> > return RedoRecPtr;
> > }
> >
> > +bool
> > +RedoRecPtrChanged(XLogRecPtr comparator_ptr)
> > +{
> > + XLogRecPtr ptr;
> > +
> > + SpinLockAcquire(&XLogCtl->info_lck);
> > + ptr = XLogCtl->RedoRecPtr;
> > + SpinLockRelease(&XLogCtl->info_lck);
> > +
> > + if (RedoRecPtr < ptr)
> > + RedoRecPtr = ptr;
> > + return RedoRecPtr != comparator_ptr;
> > +}
>
> What's the deal with the < comparison?

I saw that GetRedoRecPtr() does this and thought maybe I should do the
same in this function. I'm not quite sure where I should be getting the
redo pointer.

Maybe I should just call GetRedoRecPtr() and compare it to the one I
saved? I suppose I also thought that maybe someone else in the future
would like to have a function like RedoRecPtrChanged().

- Melanie

Attachment Content-Type Size
v3-0001-Add-unbuffered-IO-and-avoid-immed-fsync.patch text/x-patch 27.2 KB
v3-0002-Use-shared-buffers-when-possible-for-index-build.patch text/x-patch 19.4 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Zhihong Yu 2022-01-10 23:34:27 Re: null iv parameter passed to combo_init()
Previous Message Andrew Dunstan 2022-01-10 22:34:56 Re: CREATEROLE and role ownership hierarchies