Re: Avoiding smgrimmedsync() during nbtree index builds

From: Melanie Plageman <melanieplageman(at)gmail(dot)com>
To: Justin Pryzby <pryzby(at)telsasoft(dot)com>
Cc: Andres Freund <andres(at)anarazel(dot)de>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>, Peter Geoghegan <pg(at)bowt(dot)ie>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, Dmitry Dolgov <9erthalion6(at)gmail(dot)com>
Subject: Re: Avoiding smgrimmedsync() during nbtree index builds
Date: 2022-03-04 22:03:09
Message-ID: CAAKRu_ZQEpk6Q1WtNLgfXBdCmdU5xN_w0boVO6faO_Ax+ckjig@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, Mar 2, 2022 at 8:09 PM Justin Pryzby <pryzby(at)telsasoft(dot)com> wrote:
>
> Rebased to appease cfbot.
>
> I ran these paches under a branch which shows code coverage in cirrus. It
> looks good to my eyes.
> https://api.cirrus-ci.com/v1/artifact/task/5212346552418304/coverage/coverage/00-index.html

thanks for doing that and for the rebase! since I made updates, the
attached version 6 is also rebased.

To Dmitry's question:

On Sun, Feb 13, 2022 at 9:33 AM Dmitry Dolgov <9erthalion6(at)gmail(dot)com> wrote:
>
> > On Wed, Feb 09, 2022 at 01:49:30PM -0500, Melanie Plageman wrote:
>
> I don't see it in the discussion, so naturally curious -- why directmgr
> is not used for bloom index, e.g. in blbuildempty?

thanks for pointing this out. blbuildempty() is also included now. bloom
doesn't seem to use smgr* anywhere except blbuildempty(), so that is the
only place I made changes in bloom index build.

v6 has the following updates/changes:

- removed erroneous extra calls to unbuffered_prep() and
unbuffered_finish() in GiST and btree index builds

- removed unnecessary includes

- some comments were updated to accurately reflect use of directmgr

- smgrwrite doesn't WAL-log anymore. This one I'm not sure about. I
think it makes sense that unbuffered_extend() of non-empty pages of
WAL-logged relations (or the init fork of unlogged relations) do
log_newpage(), but I wasn't so sure about smgrwrite().

Currently all callers of smgrwrite() do log_newpage() and anyone using
mdwrite() will end up writing the whole page. However, it seems
possible that a caller of the directmgr API might wish to do a write
to a particular offset and, either because of that, or, for some other
reason, require different logging than that done in log_newpage().

I didn't want to make a separate wrapper function for WAL-logging in
directmgr because it felt like one more step to forget.

- heapam_relation_set_new_filenode doesn't use directmgr API anymore for
unlogged relations. It doesn't extend or write, so it seemed like a
special case better left alone.

Note that the ambuildempty() functions which also write to the init
fork of an unlogged relation still use the directmgr API. It is a bit
confusing because they pass do_wal=true to unbuffered_prep() even
though they are unlogged relations because they must log and fsync.

- interface changes to unbuffered_prep()
I removed the parameters to unbuffered_prep() which required the user
to specify if fsync should be added to pendingOps or done with
smgrimmedsync(). Understanding all of the combinations of these
parameters and when they were needed was confusing and the interface
felt like a foot gun. Special cases shouldn't use this interface.

I prefer the idea that users of this API expect that
1) empty pages won't be checksummed or WAL logged
2) for relations that are WAL-logged, when the build is done, the
relation will be fsync'd by the backend (unless the fsync optimization
is being used)
3) the only case in which fsync requests are added to the pendingOps
queue is when the fsync optimization is being used (which saves the
redo pointer and check it later to determine if it needs to fsync
itself)

I also added the parameter "do_wal" to unbuffered_prep() and the
UnBufferedWriteState struct. This is used when extending the file to
determine whether or not to log_newpage(). unbuffered_extend() and
unbuffered_write() no longer take do_wal as a parameter.

Note that callers need to pass do_wal=true/false to unbuffered_prep()
based on whether or not they want log_newpage() called during
unbuffered_extend()--not simply based on whether or not the relation
in question is WAL-logged.

do_wal is the only member of the UnBufferedWriteState struct in the
first patch in the set, but I think it makes sense to keep the struct
around since I anticipate that the patch containing the other members
needed for the fsync optimization will be committed at the same time.

One final note on unbuffered_prep() -- I am thinking of renaming
"do_optimization" to "try_optimization" or maybe
"request_fsync_optimization". The interface (of unbuffered_prep())
would be better if we always tried to do the optimization when
relevant (when the relation is WAL-logged).

- freespace map, visimap, and hash index don't use directmgr API anymore
For most use cases writing/extending outside shared buffers, it isn't
safe to rely on requesting fsync from checkpointer.

visimap, fsm, and hash index all request fsync from checkpointer for
the pages they write with smgrextend() and don't smgrimmedsync() when
finished adding pages (even when the hash index is WAL-logged).

Supporting these exceptions made the interface incoherent, so I cut
them.

- added unbuffered_extend_range()
This one is a bit unfortunate. Because GiST index build uses
log_newpages() to log a whole page range but calls smgrextend() for
each of those pages individually, I couldn't use the
unbuffered_extend() function easily.

So, I thought it might be useful in other contexts as well to have a
function which calls smgrextend() for a range of pages and then calls
log_newpages(). I've added this.

However, there are two parts of GiST index build flush ready pages
that didn't work with this either.

The first is that it does an error check on the block numbers one at a
time while looping through them to write the pages. To retain this
check, I loop through the ready pages an extra time before calling
unbuffered_extend(), which is probably not acceptable.

Also, GiST needs to use a custom LSN for the pages. To achieve this, I
added a "custom_lsn" parameter to unbuffered_extend_range(). This
isn't great either. If this was a more common case, I could pass the
custom LSN to unbuffered_prep().

- Melanie

Attachment Content-Type Size
v6-0003-BTree-index-use-unbuffered-IO-optimization.patch text/x-patch 1.9 KB
v6-0001-Add-unbuffered-IO-API.patch text/x-patch 31.2 KB
v6-0004-Use-shared-buffers-when-possible-for-index-build.patch text/x-patch 20.1 KB
v6-0002-Avoid-immediate-fsync-for-unbuffered-IO.patch text/x-patch 9.6 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Andres Freund 2022-03-04 22:04:44 Re: Regression tests failures on Windows Server 2019 - on master at commit # d816f366b
Previous Message Andres Freund 2022-03-04 21:56:57 Re: Regression tests failures on Windows Server 2019 - on master at commit # d816f366b