Quick Links

Re: Avoiding smgrimmedsync() during nbtree index builds

From:	Andres Freund <andres(at)anarazel(dot)de>
To:	Melanie Plageman <melanieplageman(at)gmail(dot)com>
Cc:	Justin Pryzby <pryzby(at)telsasoft(dot)com>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>, Peter Geoghegan <pg(at)bowt(dot)ie>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, Dmitry Dolgov <9erthalion6(at)gmail(dot)com>
Subject:	Re: Avoiding smgrimmedsync() during nbtree index builds
Date:	2022-03-10 18:32:51
Message-ID:	20220310183251.3h4qokwp5leit34j@alap3.anarazel.de
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

Hi,

> From a06407b19c8d168ea966e45c0e483b38d49ddc12 Mon Sep 17 00:00:00 2001
> From: Melanie Plageman <melanieplageman(at)gmail(dot)com>
> Date: Fri, 4 Mar 2022 14:48:39 -0500
> Subject: [PATCH v6 1/4] Add unbuffered IO API

I think this or one of the following patches should update src/backend/access/transam/README

> @@ -164,6 +164,16 @@ void
> blbuildempty(Relation index)
> {
> Page metapage;
> + UnBufferedWriteState wstate;
> +
> + /*
> + * Though this is an unlogged relation, pass do_wal=true since the init
> + * fork of an unlogged index must be wal-logged and fsync'd. This currently
> + * has no effect, as unbuffered_write() does not do log_newpage()
> + * internally. However, were this to be replaced with unbuffered_extend(),
> + * do_wal must be true to ensure the data is logged and fsync'd.
> + */
> + unbuffered_prep(&wstate, true);

Wonder if unbuffered_write should have an assert checking that no writes to
INIT_FORKNUM are non-durable? Looks like it's pretty easy to forget...

I'd choose unbuffered_begin over _prep().

> /* Construct metapage. */
> metapage = (Page) palloc(BLCKSZ);
> @@ -176,18 +186,13 @@ blbuildempty(Relation index)
> * XLOG_DBASE_CREATE or XLOG_TBLSPC_CREATE record. Therefore, we need
> * this even when wal_level=minimal.
> */
> - PageSetChecksumInplace(metapage, BLOOM_METAPAGE_BLKNO);
> - smgrwrite(RelationGetSmgr(index), INIT_FORKNUM, BLOOM_METAPAGE_BLKNO,
> - (char *) metapage, true);
> + unbuffered_write(&wstate, RelationGetSmgr(index), INIT_FORKNUM,
> + BLOOM_METAPAGE_BLKNO, metapage);
> +
> log_newpage(&(RelationGetSmgr(index))->smgr_rnode.node, INIT_FORKNUM,
> BLOOM_METAPAGE_BLKNO, metapage, true);
>
> - /*
> - * An immediate sync is required even if we xlog'd the page, because the
> - * write did not go through shared_buffers and therefore a concurrent
> - * checkpoint may have moved the redo pointer past our xlog record.
> - */
> - smgrimmedsync(RelationGetSmgr(index), INIT_FORKNUM);
> + unbuffered_finish(&wstate, RelationGetSmgr(index), INIT_FORKNUM);
> }

I mildly prefer complete over finish, but ...

> - * We can't use the normal heap_insert function to insert into the new
> - * heap, because heap_insert overwrites the visibility information.
> - * We use a special-purpose raw_heap_insert function instead, which
> - * is optimized for bulk inserting a lot of tuples, knowing that we have
> - * exclusive access to the heap. raw_heap_insert builds new pages in
> - * local storage. When a page is full, or at the end of the process,
> - * we insert it to WAL as a single record and then write it to disk
> - * directly through smgr. Note, however, that any data sent to the new
> - * heap's TOAST table will go through the normal bufmgr.
> + * We can't use the normal heap_insert function to insert into the new heap,
> + * because heap_insert overwrites the visibility information. We use a
> + * special-purpose raw_heap_insert function instead, which is optimized for
> + * bulk inserting a lot of tuples, knowing that we have exclusive access to the
> + * heap. raw_heap_insert builds new pages in local storage. When a page is
> + * full, or at the end of the process, we insert it to WAL as a single record
> + * and then write it to disk directly through directmgr. Note, however, that
> + * any data sent to the new heap's TOAST table will go through the normal
> + * bufmgr.

Part of this just reflows existing lines that seem otherwise unchanged, making
it harder to see the actual change.

> @@ -643,13 +644,6 @@ _bt_blnewpage(uint32 level)
> static void
> _bt_blwritepage(BTWriteState *wstate, Page page, BlockNumber blkno)
> {
> - /* XLOG stuff */
> - if (wstate->btws_use_wal)
> - {
> - /* We use the XLOG_FPI record type for this */
> - log_newpage(&wstate->index->rd_node, MAIN_FORKNUM, blkno, page, true);
> - }
> -
> /*
> * If we have to write pages nonsequentially, fill in the space with
> * zeroes until we come back and overwrite. This is not logically
> @@ -661,33 +655,33 @@ _bt_blwritepage(BTWriteState *wstate, Page page, BlockNumber blkno)
> {
> if (!wstate->btws_zeropage)
> wstate->btws_zeropage = (Page) palloc0(BLCKSZ);
> - /* don't set checksum for all-zero page */
> - smgrextend(RelationGetSmgr(wstate->index), MAIN_FORKNUM,
> - wstate->btws_pages_written++,
> - (char *) wstate->btws_zeropage,
> - true);
> +
> + unbuffered_extend(&wstate->ub_wstate, RelationGetSmgr(wstate->index),
> + MAIN_FORKNUM, wstate->btws_pages_written++,
> + wstate->btws_zeropage, true);
> }

For a bit I thought the true argument to unbuffered_extend was about
durability or registering fsyncs. Perhaps worth making it flags argument with
an enum for flag arguments?

> diff --git a/src/backend/storage/direct/directmgr.c b/src/backend/storage/direct/directmgr.c
> new file mode 100644
> index 0000000000..42c37daa7a
> --- /dev/null
> +++ b/src/backend/storage/direct/directmgr.c
> @@ -0,0 +1,98 @@

Now that the API is called unbuffered, the filename / directory seem
confusing.

> +void
> +unbuffered_prep(UnBufferedWriteState *wstate, bool do_wal)
> +{
> + wstate->do_wal = do_wal;
> +}
> +
> +void
> +unbuffered_extend(UnBufferedWriteState *wstate, SMgrRelation
> + smgrrel, ForkNumber forknum, BlockNumber blocknum, Page page, bool
> + empty)
> +{
> + /*
> + * Don't checksum empty pages
> + */
> + if (!empty)
> + PageSetChecksumInplace(page, blocknum);
> +
> + smgrextend(smgrrel, forknum, blocknum, (char *) page, true);
> +
> + /*
> + * Don't WAL-log empty pages
> + */
> + if (!empty && wstate->do_wal)
> + log_newpage(&(smgrrel)->smgr_rnode.node, forknum,
> + blocknum, page, true);
> +}
> +
> +void
> +unbuffered_extend_range(UnBufferedWriteState *wstate, SMgrRelation smgrrel,
> + ForkNumber forknum, int num_pages, BlockNumber *blocknums, Page *pages,
> + bool empty, XLogRecPtr custom_lsn)
> +{
> + for (int i = 0; i < num_pages; i++)
> + {
> + Page page = pages[i];
> + BlockNumber blkno = blocknums[i];
> +
> + if (!XLogRecPtrIsInvalid(custom_lsn))
> + PageSetLSN(page, custom_lsn);
> +
> + if (!empty)
> + PageSetChecksumInplace(page, blkno);
> +
> + smgrextend(smgrrel, forknum, blkno, (char *) page, true);
> + }
> +
> + if (!empty && wstate->do_wal)
> + log_newpages(&(smgrrel)->smgr_rnode.node, forknum, num_pages,
> + blocknums, pages, true);
> +}
> +
> +void
> +unbuffered_write(UnBufferedWriteState *wstate, SMgrRelation smgrrel, ForkNumber
> + forknum, BlockNumber blocknum, Page page)
> +{
> + PageSetChecksumInplace(page, blocknum);
> +
> + smgrwrite(smgrrel, forknum, blocknum, (char *) page, true);
> +}

Seem several of these should have some minimal documentation?

> +/*
> + * When writing data outside shared buffers, a concurrent CHECKPOINT can move
> + * the redo pointer past our WAL entries and won't flush our data to disk. If
> + * the database crashes before the data makes it to disk, our WAL won't be
> + * replayed and the data will be lost.
> + * Thus, if a CHECKPOINT begins between unbuffered_prep() and
> + * unbuffered_finish(), the backend must fsync the data itself.
> + */

Hm. The last sentence sounds like this happens conditionally, but it doesn't
at this point.

> +typedef struct UnBufferedWriteState
> +{
> + /*
> + * When writing WAL-logged relation data outside of shared buffers, there
> + * is a risk of a concurrent CHECKPOINT moving the redo pointer past the
> + * data's associated WAL entries. To avoid this, callers in this situation
> + * must fsync the pages they have written themselves. This is necessary
> + * only if the relation is WAL-logged or in special cases such as the init
> + * fork of an unlogged index.
> + */
> + bool do_wal;
> +} UnBufferedWriteState;
> +/*
> + * prototypes for functions in directmgr.c
> + */

Newline in between end of struct and comment.

> +extern void
> +unbuffered_prep(UnBufferedWriteState *wstate, bool do_wal);

In headers we don't put the return type on a separate line :/

> --- a/contrib/bloom/blinsert.c
> +++ b/contrib/bloom/blinsert.c
> @@ -173,7 +173,7 @@ blbuildempty(Relation index)
> * internally. However, were this to be replaced with unbuffered_extend(),
> * do_wal must be true to ensure the data is logged and fsync'd.
> */
> - unbuffered_prep(&wstate, true);
> + unbuffered_prep(&wstate, true, false);

This makes me think this really should be a flag argument...

I also don't like the current name of the parameter "do_optimization" doesn't
explain much.

> +bool RedoRecPtrChanged(XLogRecPtr comparator_ptr)
> +{

newline after return type.

> void
> -unbuffered_prep(UnBufferedWriteState *wstate, bool do_wal)
> +unbuffered_prep(UnBufferedWriteState *wstate, bool do_wal, bool
> + do_optimization)

See earlier comments about documentation and parameter naming...

> + * These callers can optionally use the following optimization: attempt to
> + * use the sync request queue and fall back to fsync'ing the pages
> + * themselves if the Redo pointer moves between the start and finish of
> + * their write. In order to do this, they must set do_optimization to true
> + * so that the redo pointer is saved before the write begins.
> */

When do we not want this?

> From 17fb22142ade65fdbe8c90889e49d0be60ba45e4 Mon Sep 17 00:00:00 2001
> From: Melanie Plageman <melanieplageman(at)gmail(dot)com>
> Date: Fri, 4 Mar 2022 15:53:05 -0500
> Subject: [PATCH v6 3/4] BTree index use unbuffered IO optimization
>
> While building a btree index, the backend can avoid fsync'ing all of the
> pages if it uses the optimization introduced in a prior commit.
>
> This can substantially improve performance when many indexes are being
> built during DDL operations.
> ---
> src/backend/access/nbtree/nbtree.c | 2 +-
> src/backend/access/nbtree/nbtsort.c | 6 +++++-
> 2 files changed, 6 insertions(+), 2 deletions(-)
>
> diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
> index 6b78acefbe..fc5cce4603 100644
> --- a/src/backend/access/nbtree/nbtree.c
> +++ b/src/backend/access/nbtree/nbtree.c
> @@ -161,7 +161,7 @@ btbuildempty(Relation index)
> * internally. However, were this to be replaced with unbuffered_extend(),
> * do_wal must be true to ensure the data is logged and fsync'd.
> */
> - unbuffered_prep(&wstate, true, false);
> + unbuffered_prep(&wstate, true, true);
>
> /* Construct metapage. */
> metapage = (Page) palloc(BLCKSZ);
> diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c
> index d6d0d4b361..f1b9e2e24e 100644
> --- a/src/backend/access/nbtree/nbtsort.c
> +++ b/src/backend/access/nbtree/nbtsort.c
> @@ -1189,7 +1189,11 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
> int64 tuples_done = 0;
> bool deduplicate;
>
> - unbuffered_prep(&wstate->ub_wstate, wstate->btws_use_wal, false);
> + /*
> + * The fsync optimization done by directmgr is only relevant if
> + * WAL-logging, so pass btws_use_wal for this parameter.
> + */
> + unbuffered_prep(&wstate->ub_wstate, wstate->btws_use_wal, wstate->btws_use_wal);
>
> deduplicate = wstate->inskey->allequalimage && !btspool->isunique &&
> BTGetDeduplicateItems(wstate->index);

Why just here?

> From 377c195bccf2dd2529e64d0d453104485f7662b7 Mon Sep 17 00:00:00 2001
> From: Melanie Plageman <melanieplageman(at)gmail(dot)com>
> Date: Fri, 4 Mar 2022 15:52:45 -0500
> Subject: [PATCH v6 4/4] Use shared buffers when possible for index build
>
> When there are not too many tuples, building the index in shared buffers
> makes sense. It allows the buffer manager to handle how best to do the
> IO.
> ---

Perhaps it'd be worth making this an independent patch that could be applied
separately?

> src/backend/access/nbtree/nbtree.c | 32 ++--
> src/backend/access/nbtree/nbtsort.c | 273 +++++++++++++++++++++-------
> 2 files changed, 223 insertions(+), 82 deletions(-)
>
> diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
> index fc5cce4603..d3982b9388 100644
> --- a/src/backend/access/nbtree/nbtree.c
> +++ b/src/backend/access/nbtree/nbtree.c
> @@ -152,34 +152,24 @@ void
> btbuildempty(Relation index)
> {
> Page metapage;
> - UnBufferedWriteState wstate;
> + Buffer metabuf;
>
> /*
> - * Though this is an unlogged relation, pass do_wal=true since the init
> - * fork of an unlogged index must be wal-logged and fsync'd. This currently
> - * has no effect, as unbuffered_write() does not do log_newpage()
> - * internally. However, were this to be replaced with unbuffered_extend(),
> - * do_wal must be true to ensure the data is logged and fsync'd.
> + * Allocate a buffer for metapage and initialize metapage.
> */
> - unbuffered_prep(&wstate, true, true);
> -
> - /* Construct metapage. */
> - metapage = (Page) palloc(BLCKSZ);
> + metabuf = ReadBufferExtended(index, INIT_FORKNUM, P_NEW, RBM_ZERO_AND_LOCK,
> + NULL);
> + metapage = BufferGetPage(metabuf);
> _bt_initmetapage(metapage, P_NONE, 0, _bt_allequalimage(index, false));
>
> /*
> - * Write the page and log it. It might seem that an immediate sync would
> - * be sufficient to guarantee that the file exists on disk, but recovery
> - * itself might remove it while replaying, for example, an
> - * XLOG_DBASE_CREATE or XLOG_TBLSPC_CREATE record. Therefore, we need
> - * this even when wal_level=minimal.
> + * Mark metapage buffer as dirty and XLOG it
> */
> - unbuffered_write(&wstate, RelationGetSmgr(index), INIT_FORKNUM,
> - BTREE_METAPAGE, metapage);
> - log_newpage(&RelationGetSmgr(index)->smgr_rnode.node, INIT_FORKNUM,
> - BTREE_METAPAGE, metapage, true);
> -
> - unbuffered_finish(&wstate, RelationGetSmgr(index), INIT_FORKNUM);
> + START_CRIT_SECTION();
> + MarkBufferDirty(metabuf);
> + log_newpage_buffer(metabuf, true);
> + END_CRIT_SECTION();
> + _bt_relbuf(index, metabuf);
> }

I don't understand why this patch changes btbuildempty()? That data is never
accessed again, so it doesn't really seem beneficial to shuffle it through
shared buffers, even if we benefit from using s_b for the main fork...

> + /*
> + * If not using shared buffers, for a WAL-logged relation, save the redo
> + * pointer location in case a checkpoint begins during the index build.
> + */
> + if (wstate->_bt_bl_unbuffered_prep)
> + wstate->_bt_bl_unbuffered_prep(&wstate->ub_wstate,
> + wstate->btws_use_wal, wstate->btws_use_wal);

Code would probably look cleaner if an empty callback were used when no
_bt_bl_unbuffered_prep callback is needed.

> /*
> - * allocate workspace for a new, clean btree page, not linked to any siblings.
> + * Set up workspace for a new, clean btree page, not linked to any siblings.
> + * Caller must allocate the passed in page.

More interesting bit seems to be whether the passed in page needs to be
initialized?

> @@ -1154,20 +1285,24 @@ _bt_uppershutdown(BTWriteState *wstate, BTPageState *state)
> * back one slot. Then we can dump out the page.
> */
> _bt_slideleft(s->btps_page);
> - _bt_blwritepage(wstate, s->btps_page, s->btps_blkno);
> + wstate->_bt_blwritepage(wstate, s->btps_page, s->btps_blkno, s->btps_buf);
> + s->btps_buf = InvalidBuffer;
> s->btps_page = NULL; /* writepage freed the workspace */
> }

Do we really have to add underscores even to struct members? That just looks
wrong.

Greetings,

Andres Freund

In response to

Re: Avoiding smgrimmedsync() during nbtree index builds at 2022-03-04 22:03:09 from Melanie Plageman

Responses

Re: Avoiding smgrimmedsync() during nbtree index builds at 2022-03-30 18:54:23 from Greg Stark

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Peter Eisentraut	2022-03-10 19:05:02	Re: role self-revocation
Previous Message	Tomas Vondra	2022-03-10 18:20:10	Re: Column Filtering in Logical Replication