Re: Avoiding smgrimmedsync() during nbtree index builds

From: Melanie Plageman <melanieplageman(at)gmail(dot)com>
To: Andres Freund <andres(at)anarazel(dot)de>
Cc: Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>, Peter Geoghegan <pg(at)bowt(dot)ie>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>
Subject: Re: Avoiding smgrimmedsync() during nbtree index builds
Date: 2021-05-03 21:24:50
Message-ID: CAAKRu_byoShuRoDchgM29B98mqwH0bnha9k1=Rw+NOvC_F6dzw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

So, I've written a patch which avoids doing the immediate fsync for
index builds either by using shared buffers or by queueing sync requests
for the checkpointer. If a checkpoint starts during the index build and
the backend is not using shared buffers for the index build, it will
need to do the fsync.

The reviewer will notice that _bt_load() extends the index relation for
the metapage before beginning the actual load of leaf pages but does not
actually write the metapage until the end of the index build. When using
shared buffers, it was difficult to create block 0 of the index after
creating all of the other blocks, as the block number is assigned inside
of ReadBuffer_common(), and it doesn't really work with the current
bufmgr API to extend a relation with a caller-specified block number.

I am not entirely sure of the correctness of doing an smgrextend() (when
not using shared buffers) without writing any WAL. However, the metapage
contents are not written until after WAL logging them later in
_bt_blwritepage(), so, perhaps it is okay?

I am also not fond of the change to the signature of _bt_uppershutdown()
that this implementation forces. Now, I must pass the shared buffer
(when using shared buffers) that I've reserved (pinned and locked) for
the metapage and, if not using shared buffers, the page I've allocated
for the metapage, before doing the index build to _bt_uppershutdown()
after doing the rest of the index build. I don't know that it seems
incorrect -- more that it feels a bit messy (and inefficient) to hold
onto that shared buffer or memory for the duration of the index build,
during which I have no intention of doing anything with that buffer or
memory. However, the alternative I devised was to change
ReadBuffer_common() or to add a new ReadBufferExtended() mode which
indicated that the caller would specify the block number and whether or
not it was an extend, which also didn't seem right.

For the extensions of the index done during index build, I use
ReadBufferExtended() directly instead of _bt_getbuf() for a few reasons.
I thought (am not sure) that I don't need to do
LockRelationForExtension() during index build. Also, I decided to use
RBM_ZERO_AND_LOCK mode so that I had an exclusive lock on the buffer
content instead of doing _bt_lockbuf() (which is what _bt_getbuf()
does). And, most of the places I added the call to ReadBufferExtended(),
the non-shared buffer code path is already initializing the page, so it
made more sense to just share that codepath.

I considered whether or not it made sense to add a new btree utility
function which calls ReadBufferExtended() in this way, however, I wasn't
sure how much that would buy me. The other place it might be able to be
used is btvacuumpage(), but that case is different enough that I'm not
even sure what the function would be called -- basically it would just
be an alternative to _bt_getbuf() for a couple of somewhat unrelated edge
cases.

On Thu, Jan 21, 2021 at 5:51 PM Andres Freund <andres(at)anarazel(dot)de> wrote:
>
> Hi,
>
> On 2021-01-21 23:54:04 +0200, Heikki Linnakangas wrote:
> > On 21/01/2021 22:36, Andres Freund wrote:
> > > A quick hack (probably not quite correct!) to evaluate the benefit shows
> > > that the attached script takes 2m17.223s with the smgrimmedsync and
> > > 0m22.870s passing skipFsync=false to write/extend. Entirely IO bound in
> > > the former case, CPU bound in the latter.
> > >
> > > Creating lots of tables with indexes (directly or indirectly through
> > > relations having a toast table) is pretty common, particularly after the
> > > introduction of partitioning.
> > >

Moving index builds of indexes which would fit in shared buffers back
into shared buffers has the benefit of eliminating the need to write
them out and fsync them if they will be subsequently used and thus read
right back into shared buffers. This avoids some of the unnecessary
fsyncs Andres is talking about here as well as avoiding some of the
extra IO required to write them and then read them into shared buffers.

I have dummy criteria for whether or not to use shared buffers (if the
number of tuples to be indexed is > 1000). I am considering using a
threshold of some percentage of the size of shared buffers as the
actual criteria for determining where to do the index build.

> > >
> > > Thinking through the correctness of replacing smgrimmedsync() with sync
> > > requests, the potential problems that I can see are:
> > >
> > > 1) redo point falls between the log_newpage() and the
> > > write()/register_dirty_segment() in smgrextend/smgrwrite.
> > > 2) redo point falls between write() and register_dirty_segment()
> > >
> > > But both of these are fine in the context of initially filling a newly
> > > created relfilenode, as far as I can tell? Otherwise the current
> > > smgrimmedsync() approach wouldn't be safe either, as far as I can tell?
> >
> > Hmm. If the redo point falls between write() and the
> > register_dirty_segment(), and the checkpointer finishes the whole checkpoint
> > before register_dirty_segment(), you are not safe. That can't happen with
> > write from the buffer manager, because the checkpointer would block waiting
> > for the flush of the buffer to finish.
>
> Hm, right.
>
> The easiest way to address that race would be to just record the redo
> pointer in _bt_leafbuild() and continue to do the smgrimmedsync if a
> checkpoint started since the start of the index build.
>
> Another approach would be to utilize PGPROC.delayChkpt, but I would
> rather not unnecessarily expand the use of that.
>
> It's kind of interesting - in my aio branch I moved the
> register_dirty_segment() to before the actual asynchronous write (due to
> availability of the necessary data), which ought to be safe because of
> the buffer interlocking. But that doesn't work here, or for other places
> doing writes without going through s_b. It'd be great if we could come
> up with a general solution, but I don't immediately see anything great.
>
> The best I can come up with is adding helper functions to wrap some of
> the complexity for "unbuffered" writes of doing an immedsync iff the
> redo pointer changed. Something very roughly like
>
> typedef struct UnbufferedWriteState { XLogRecPtr redo; uint64 numwrites;} UnbufferedWriteState;
> void unbuffered_prep(UnbufferedWriteState* state);
> void unbuffered_write(UnbufferedWriteState* state, ...);
> void unbuffered_extend(UnbufferedWriteState* state, ...);
> void unbuffered_finish(UnbufferedWriteState* state);
>
> which wouldn't just do the dance to avoid the immedsync() if possible,
> but also took care of PageSetChecksumInplace() (and PageEncryptInplace()
> if we get that [1]).
>

Regarding the implementation, I think having an API to do these
"unbuffered" or "direct" writes outside of shared buffers is a good
idea. In this specific case, the proposed API would not change the code
much. I would just wrap the small diffs I added to the beginning and end
of _bt_load() in the API calls for unbuffered_prep() and
unbuffered_finish() and then tuck away the second half of
_bt_blwritepage() in unbuffered_write()/unbuffered_extend(). I figured I
would do so after ensuring the correctness of the logic in this patch.
Then I will work on a patch which implements the unbuffered_write() API
and demonstrates its utility with at least a few of the most compelling
most compelling use cases in the code.

- Melanie

Attachment Content-Type Size
v1-0001-Index-build-avoids-immed-fsync.patch text/x-patch 18.5 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Peter Geoghegan 2021-05-03 22:07:22 Re: MaxOffsetNumber for Table AMs
Previous Message Tom Lane 2021-05-03 21:13:49 Re: PG in container w/ pid namespace is init, process exits cause restart