Re: Avoiding smgrimmedsync() during nbtree index builds

From: Melanie Plageman <melanieplageman(at)gmail(dot)com>
To: Andres Freund <andres(at)anarazel(dot)de>
Cc: Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>, Peter Geoghegan <pg(at)bowt(dot)ie>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>
Subject: Re: Avoiding smgrimmedsync() during nbtree index builds
Date: 2021-09-29 18:35:47
Message-ID: CAAKRu_aw72w70X1P=ba20K8iGUvSkyz7Yk03wPPh3f9WgmcJ3g@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Mon, May 3, 2021 at 5:24 PM Melanie Plageman
<melanieplageman(at)gmail(dot)com> wrote:
> On Thu, Jan 21, 2021 at 5:51 PM Andres Freund <andres(at)anarazel(dot)de> wrote:
> > On 2021-01-21 23:54:04 +0200, Heikki Linnakangas wrote:
> > > On 21/01/2021 22:36, Andres Freund wrote:
> > > >
> > > > Thinking through the correctness of replacing smgrimmedsync() with sync
> > > > requests, the potential problems that I can see are:
> > > >
> > > > 1) redo point falls between the log_newpage() and the
> > > > write()/register_dirty_segment() in smgrextend/smgrwrite.
> > > > 2) redo point falls between write() and register_dirty_segment()
> > > >
> > > > But both of these are fine in the context of initially filling a newly
> > > > created relfilenode, as far as I can tell? Otherwise the current
> > > > smgrimmedsync() approach wouldn't be safe either, as far as I can tell?
> > >
> > > Hmm. If the redo point falls between write() and the
> > > register_dirty_segment(), and the checkpointer finishes the whole checkpoint
> > > before register_dirty_segment(), you are not safe. That can't happen with
> > > write from the buffer manager, because the checkpointer would block waiting
> > > for the flush of the buffer to finish.
> >
> > Hm, right.
> >
> > The easiest way to address that race would be to just record the redo
> > pointer in _bt_leafbuild() and continue to do the smgrimmedsync if a
> > checkpoint started since the start of the index build.
> >
> > Another approach would be to utilize PGPROC.delayChkpt, but I would
> > rather not unnecessarily expand the use of that.
> >
> > It's kind of interesting - in my aio branch I moved the
> > register_dirty_segment() to before the actual asynchronous write (due to
> > availability of the necessary data), which ought to be safe because of
> > the buffer interlocking. But that doesn't work here, or for other places
> > doing writes without going through s_b. It'd be great if we could come
> > up with a general solution, but I don't immediately see anything great.
> >
> > The best I can come up with is adding helper functions to wrap some of
> > the complexity for "unbuffered" writes of doing an immedsync iff the
> > redo pointer changed. Something very roughly like
> >
> > typedef struct UnbufferedWriteState { XLogRecPtr redo; uint64 numwrites;} UnbufferedWriteState;
> > void unbuffered_prep(UnbufferedWriteState* state);
> > void unbuffered_write(UnbufferedWriteState* state, ...);
> > void unbuffered_extend(UnbufferedWriteState* state, ...);
> > void unbuffered_finish(UnbufferedWriteState* state);
> >
> > which wouldn't just do the dance to avoid the immedsync() if possible,
> > but also took care of PageSetChecksumInplace() (and PageEncryptInplace()
> > if we get that [1]).
> >
>
> Regarding the implementation, I think having an API to do these
> "unbuffered" or "direct" writes outside of shared buffers is a good
> idea. In this specific case, the proposed API would not change the code
> much. I would just wrap the small diffs I added to the beginning and end
> of _bt_load() in the API calls for unbuffered_prep() and
> unbuffered_finish() and then tuck away the second half of
> _bt_blwritepage() in unbuffered_write()/unbuffered_extend(). I figured I
> would do so after ensuring the correctness of the logic in this patch.
> Then I will work on a patch which implements the unbuffered_write() API
> and demonstrates its utility with at least a few of the most compelling
> most compelling use cases in the code.
>

I've taken a pass at writing the API for "direct" or "unbuffered" writes
and extends. It introduces the suggested functions: unbuffered_prep(),
unbuffered_finish(), unbuffered_write(), and unbuffered_extend().

This is a rough cut -- corrections welcome and encouraged!

unbuffered_prep() saves the xlog redo pointer at the time it is called.
Then, if the redo pointer hasn't changed by the time unbuffered_finish()
is called, the backend can avoid calling smgrimmedsync(). Note that this
only works if intervening calls to smgrwrite() and smgrextend() pass
skipFsync=False.

unbuffered_write() and unbuffered_extend() might be able to be used even
if unbuffered_prep() and unbuffered_finish() are not used -- for example
hash indexes do something I don't entirely understand in which they call
smgrextend() directly when allocating buckets but then initialize the
new bucket pages using the bufmgr machinery.

I also intend to move accounting of pages written and extended into the
unbuffered_write() and unbuffered_extend() functions using the functions
I propose in [1] to populate a new view, pg_stat_buffers. Then this
"unbuffered" IO would be counted in stats.

I picked the name "direct" for the directory in /src/backend/storage
because I thought that these functions are analogous to direct IO in
Linux -- in that they are doing IO without going through Postgres bufmgr
-- unPGbuffered, basically. Other suggestions were "raw" and "relIO".
Raw seemed confusing since raw device IO is pretty far from what is
happening here. RelIO didn't seem like it belonged next to bufmgr (to
me). However, direct and unbuffered will both soon become fraught
terminology with the introduction of AIO and direct IO to Postgres...

- Melanie

[1] https://www.postgresql.org/message-id/flat/20200124195226.lth52iydq2n2uilq%40alap3.anarazel.de

Attachment Content-Type Size
v1-0001-Add-unbuffered-IO-and-avoid-immed-fsync.patch text/x-patch 25.7 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Robert Haas 2021-09-29 18:43:53 Re: when the startup process doesn't (logging startup delays)
Previous Message Peter Geoghegan 2021-09-29 18:27:28 Enabling deduplication with system catalog indexes