Re: Avoiding smgrimmedsync() during nbtree index builds

From: Melanie Plageman <melanieplageman(at)gmail(dot)com>
To: Andres Freund <andres(at)anarazel(dot)de>
Cc: Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>, Peter Geoghegan <pg(at)bowt(dot)ie>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, Justin Pryzby <pryzby(at)telsasoft(dot)com>
Subject: Re: Avoiding smgrimmedsync() during nbtree index builds
Date: 2022-02-09 18:49:30
Message-ID: CAAKRu_ZmkCNz-=5U5wZyyi3=Usfs1btsTKc_L9r1ceFvbxE3kg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,
v5 attached and all email feedback addressed below

On Thu, Jan 13, 2022 at 12:18 PM Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
>
> On Wed, Sep 29, 2021 at 2:36 PM Melanie Plageman
> <melanieplageman(at)gmail(dot)com> wrote:
> > unbuffered_write() and unbuffered_extend() might be able to be used even
> > if unbuffered_prep() and unbuffered_finish() are not used -- for example
> > hash indexes do something I don't entirely understand in which they call
> > smgrextend() directly when allocating buckets but then initialize the
> > new bucket pages using the bufmgr machinery.
>
> My first thought was that someone might do this to make sure that we
> don't run out of disk space after initializing some but not all of the
> buckets. Someone might have some reason for wanting to avoid that
> corner case. However, in _hash_init() that explanation doesn't make
> any sense, because an abort would destroy the entire relation. And in
> _hash_alloc_buckets() the variable "zerobuf" points to a buffer that
> is not, in fact, all zeroes. So my guess is this is just old, crufty
> code - either whatever reasons somebody had for doing it that way are
> no longer valid, or there wasn't any good reason even at the time.

I notice in the comment before _hash_alloc_buckets() is called, it says

/*
* We treat allocation of buckets as a separate WAL-logged action.
* Even if we fail after this operation, won't leak bucket pages;
* rather, the next split will consume this space. In any case, even
* without failure we don't use all the space in one split operation.
*/

Does this mean that it is okay that these pages are written outside of
shared buffers and, though skipFsync is passed as false, a checkpoint
starting and finishing between writing the WAL and
register_dirty_segment() followed by a crash could result in lost data?

On Thu, Jan 13, 2022 at 10:52 AM Justin Pryzby <pryzby(at)telsasoft(dot)com> wrote:
> I think the ifndef should be outside the includes:

Thanks, fixed!

On Sun, Jan 16, 2022 at 3:26 PM Justin Pryzby <pryzby(at)telsasoft(dot)com> wrote:
> Separate from this issue, I wonder if it'd be useful to write a DEBUG log
> showing when btree uses shared_buffers vs fsync. And a regression test which
> first SETs client_min_messages=debug to capture the debug log to demonstrate
> when/that new code path is being hit. I'm not sure if that would be good to
> merge, but it may be useful for now.

I will definitely think about doing this.

On Mon, Jan 17, 2022 at 12:22 PM Justin Pryzby <pryzby(at)telsasoft(dot)com> wrote:
>
> On Sun, Jan 16, 2022 at 02:25:59PM -0600, Justin Pryzby wrote:
> > On Thu, Jan 13, 2022 at 09:52:55AM -0600, Justin Pryzby wrote:
> > > This is failing on windows CI when I use initdb --data-checksums, as attached.
> > >
> > > https://cirrus-ci.com/task/5612464120266752
> > > https://api.cirrus-ci.com/v1/artifact/task/5612464120266752/regress_diffs/src/test/regress/regression.diffs
> > >
> > > +++ c:/cirrus/src/test/regress/results/bitmapops.out 2022-01-13 00:47:46.704621200 +0000
> > > ..
> > > +ERROR: could not read block 0 in file "base/16384/30310": read only 0 of 8192 bytes
> >
> > The failure isn't consistent, so I double checked my report. I have some more
> > details:
> >
> > The problem occurs maybe only ~25% of the time.
> >
> > The issue is in the 0001 patch.
> >
> > data-checksums isn't necessary to hit the issue.
> >
> > errlocation says: LOCATION: mdread, md.c:686 (the only place the error
> > exists)
> >
> > With Andres' windows crash patch, I obtained a backtrace - attached.
> > https://cirrus-ci.com/task/5978171861368832
> > https://api.cirrus-ci.com/v1/artifact/task/5978171861368832/crashlog/crashlog-postgres.exe_0fa8_2022-01-16_02-54-35-291.txt
> >
> > Maybe its a race condition or synchronization problem that nowhere else tends
> > to hit.
>
> I meant to say that I had not seen this issue anywhere but windows.
>
> But now, by chance, I still had the 0001 patch in my tree, and hit the same
> issue on linux:
>
> https://cirrus-ci.com/task/4550618281934848
> +++ /tmp/cirrus-ci-build/src/bin/pg_upgrade/tmp_check/regress/results/tuplesort.out 2022-01-17 16:06:35.759108172 +0000
> EXPLAIN (COSTS OFF)
> SELECT id, noabort_increasing, noabort_decreasing FROM abbrev_abort_uuids ORDER BY noabort_increasing LIMIT 5;
> +ERROR: could not read block 0 in file "base/16387/t3_36794": read only 0 of 8192 bytes

Yes, I think this is due to the problem Andres mentioned with my saving
the SMgrRelation and then trying to use it after a relcache flush. The
new patch version addresses this by always re-executing
RelationGetSmgr() as recommended in the comments.

On Sun, Jan 23, 2022 at 4:55 PM Andres Freund <andres(at)anarazel(dot)de> wrote:
> On 2022-01-11 12:10:54 -0500, Melanie Plageman wrote:
> > On Mon, Jan 10, 2022 at 5:50 PM Melanie Plageman
> > <melanieplageman(at)gmail(dot)com> wrote:
> > Thus, the backend must ensure that
> > either the Redo pointer has not moved or that the data is fsync'd before
> > freeing the page.
>
> "freeing"?

Yes, I agree this wording was confusing/incorrect. I meant before it
moves on (I said freeing because it usually pfrees() the page in memory
that it was writing from). I've changed the commit message.

>
> > This is not a problem with pages written in shared buffers because the
> > checkpointer will block until all buffers that were dirtied before it
> > began finish before it moves the Redo pointer past their associated WAL
> > entries.
>
> > This commit makes two main changes:
> >
> > 1) It wraps smgrextend() and smgrwrite() in functions from a new API
> > for writing data outside of shared buffers, directmgr.
> >
> > 2) It saves the XLOG Redo pointer location before doing the write or
> > extend. It also adds an fsync request for the page to the
> > checkpointer's pending-ops table. Then, after doing the write or
> > extend, if the Redo pointer has moved (meaning a checkpoint has
> > started since it saved it last), then the backend fsync's the page
> > itself. Otherwise, it lets the checkpointer take care of fsync'ing
> > the page the next time it processes the pending-ops table.
>
> Why combine those two into one commit?

I've separated it into three commits -- the above two + a separate
commit that actually has the btree index use the self-fsync
optimization.

> > @@ -654,9 +657,8 @@ vm_extend(Relation rel, BlockNumber vm_nblocks)
> > /* Now extend the file */
> > while (vm_nblocks_now < vm_nblocks)
> > {
> > - PageSetChecksumInplace((Page) pg.data, vm_nblocks_now);
> > -
> > - smgrextend(reln, VISIBILITYMAP_FORKNUM, vm_nblocks_now, pg.data, false);
> > + // TODO: aren't these pages empty? why checksum them
> > + unbuffered_extend(&ub_wstate, VISIBILITYMAP_FORKNUM, vm_nblocks_now, (Page) pg.data, false);
>
> Yea, it's a bit odd. PageSetChecksumInplace() will just return immediately:
>
> /* If we don't need a checksum, just return */
> if (PageIsNew(page) || !DataChecksumsEnabled())
> return;
>
> OTOH, it seems easier to have it there than to later forget it, when
> e.g. adding some actual initial content to the pages during the smgrextend().

I've left these as is and removed the comment.

> > @@ -560,6 +562,8 @@ _bt_leafbuild(BTSpool *btspool, BTSpool *btspool2)
> >
> > wstate.heap = btspool->heap;
> > wstate.index = btspool->index;
> > + wstate.ub_wstate.smgr_rel = RelationGetSmgr(btspool->index);
> > + wstate.ub_wstate.redo = InvalidXLogRecPtr;
> > wstate.inskey = _bt_mkscankey(wstate.index, NULL);
> > /* _bt_mkscankey() won't set allequalimage without metapage */
> > wstate.inskey->allequalimage = _bt_allequalimage(wstate.index, true);
> > @@ -656,31 +660,19 @@ _bt_blwritepage(BTWriteState *wstate, Page page, BlockNumber blkno)
> > if (!wstate->btws_zeropage)
> > wstate->btws_zeropage = (Page) palloc0(BLCKSZ);
> > /* don't set checksum for all-zero page */
> > - smgrextend(RelationGetSmgr(wstate->index), MAIN_FORKNUM,
> > - wstate->btws_pages_written++,
> > - (char *) wstate->btws_zeropage,
> > - true);
> > + unbuffered_extend(&wstate->ub_wstate, MAIN_FORKNUM, wstate->btws_pages_written++, wstate->btws_zeropage, true);
> > }
>
> There's a bunch of long lines in here...

Fixed.

> > - /*
> > - * When we WAL-logged index pages, we must nonetheless fsync index files.
> > - * Since we're building outside shared buffers, a CHECKPOINT occurring
> > - * during the build has no way to flush the previously written data to
> > - * disk (indeed it won't know the index even exists). A crash later on
> > - * would replay WAL from the checkpoint, therefore it wouldn't replay our
> > - * earlier WAL entries. If we do not fsync those pages here, they might
> > - * still not be on disk when the crash occurs.
> > - */
> > if (wstate->btws_use_wal)
> > - smgrimmedsync(RelationGetSmgr(wstate->index), MAIN_FORKNUM);
> > + unbuffered_finish(&wstate->ub_wstate, MAIN_FORKNUM);
> > }
>
> The API of unbuffered_finish() only sometimes getting called, but
> unbuffered_prep() being unconditional, strikes me as prone to bugs. Perhaps
> it'd make sense to pass in whether the relation needs to be synced or not instead?

I've fixed this. Now unbuffered_prep() and unbuffered_finish() will
always be called. I've added a few options to unbuffered_prep() to
indicate whether or not the smgrimmedsync() should be called in the end
as well as whether or not skipFsync should be passed as true or false to
smgrextend() and smgrwrite() and whether or not the avoiding self-fsync
optimization should be used.

I found it best to do it this way because simply passing whether or not
to do the sync to unbuffered_finish() did not allow me to distinguish
between the case in which the sync should not be done ever (because the
caller did not call smgrimmedsync() or because the relation does not
require WAL) and when smgrimmedsync() should only be done if the redo
pointer has changed (in the case of the optimization).

I thought it actually made for a better API to specify up front (in
unbuffered_prep()) whether or not the caller should be prepared to do
the fsync itself or not and whether it not it wanted to do the
optimization. It feels less prone to error and omission.

> > spgbuildempty(Relation index)
> > {
> > Page page;
> > + UnBufferedWriteState wstate;
> > +
> > + wstate.smgr_rel = RelationGetSmgr(index);
> > + unbuffered_prep(&wstate);
>
> I don't think that's actually safe, and one of the instances could be the
> cause cause of the bug CI is seeing:
>
> * Note: since a relcache flush can cause the file handle to be closed again,
> * it's unwise to hold onto the pointer returned by this function for any
> * long period. Recommended practice is to just re-execute RelationGetSmgr
> * each time you need to access the SMgrRelation. It's quite cheap in
> * comparison to whatever an smgr function is going to do.
> */
> static inline SMgrRelation
> RelationGetSmgr(Relation rel)

Yes, I've changed this in the attached v5.

One question I have is whether or not other callers than btree index
could benefit from the self-fsync avoidance optimization.

Also, after taking another look at gist index build, I notice that
smgrimmedsync() is not done anywhere and skipFsync is always passed as
true, so what happens if a full checkpoint and a crash happens between
WAL-logging and whenever the dirty pages make it to permanent storage?

- Melanie

Attachment Content-Type Size
v5-0001-Add-unbuffered-IO-API.patch text/x-patch 31.7 KB
v5-0002-Avoid-immediate-fsync-for-unbuffered-IO.patch text/x-patch 12.2 KB
v5-0004-Use-shared-buffers-when-possible-for-index-build.patch text/x-patch 19.1 KB
v5-0003-BTree-index-use-unbuffered-IO-optimization.patch text/x-patch 2.1 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Peter Geoghegan 2022-02-09 19:27:26 Re: decoupling table and index vacuum
Previous Message Daniel Gustafsson 2022-02-09 18:37:13 Re: How to get started with contribution