Re: Avoiding smgrimmedsync() during nbtree index builds

From: Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
To: Andres Freund <andres(at)anarazel(dot)de>, pgsql-hackers(at)postgresql(dot)org
Cc: Peter Geoghegan <pg(at)bowt(dot)ie>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>
Subject: Re: Avoiding smgrimmedsync() during nbtree index builds
Date: 2021-01-21 21:54:04
Message-ID: 812dbe94-5547-830b-714e-ad26906fb6f0@iki.fi
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 21/01/2021 22:36, Andres Freund wrote:
> Hi,
>
> Every nbtree index build currently does an smgrimmedsync at the end:
>
> /*
> * Read tuples in correct sort order from tuplesort, and load them into
> * btree leaves.
> */
> static void
> _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
> ...
> /*
> * When we WAL-logged index pages, we must nonetheless fsync index files.
> * Since we're building outside shared buffers, a CHECKPOINT occurring
> * during the build has no way to flush the previously written data to
> * disk (indeed it won't know the index even exists). A crash later on
> * would replay WAL from the checkpoint, therefore it wouldn't replay our
> * earlier WAL entries. If we do not fsync those pages here, they might
> * still not be on disk when the crash occurs.
> */
> if (wstate->btws_use_wal)
> {
> RelationOpenSmgr(wstate->index);
> smgrimmedsync(wstate->index->rd_smgr, MAIN_FORKNUM);
> }
>
> In cases we create lots of small indexes, e.g. because of an initial
> schema load, partition creation or something like that, that turns out
> to be a major limiting factor (unless one turns fsync off).
>
>
> One way to address that would be to put newly built indexes into s_b
> (using a strategy, to avoid blowing out the whole cache), instead of
> using smgrwrite() etc directly. But that's a discussion with a bit more
> complex tradeoffs.
>
>
> What I wonder is why the issue addressed in the comment I copied above
> can't more efficiently be addressed using sync requests, like we do for
> other writes? It's possibly bit more complicated than just passing
> skipFsync=false to smgrwrite/smgrextend, but it should be quite doable?

Makes sense.

> A quick hack (probably not quite correct!) to evaluate the benefit shows
> that the attached script takes 2m17.223s with the smgrimmedsync and
> 0m22.870s passing skipFsync=false to write/extend. Entirely IO bound in
> the former case, CPU bound in the latter.
>
> Creating lots of tables with indexes (directly or indirectly through
> relations having a toast table) is pretty common, particularly after the
> introduction of partitioning.
>
>
> Thinking through the correctness of replacing smgrimmedsync() with sync
> requests, the potential problems that I can see are:
>
> 1) redo point falls between the log_newpage() and the
> write()/register_dirty_segment() in smgrextend/smgrwrite.
> 2) redo point falls between write() and register_dirty_segment()
>
> But both of these are fine in the context of initially filling a newly
> created relfilenode, as far as I can tell? Otherwise the current
> smgrimmedsync() approach wouldn't be safe either, as far as I can tell?

Hmm. If the redo point falls between write() and the
register_dirty_segment(), and the checkpointer finishes the whole
checkpoint before register_dirty_segment(), you are not safe. That can't
happen with write from the buffer manager, because the checkpointer
would block waiting for the flush of the buffer to finish.

- Heikki

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2021-01-21 22:03:09 Re: strange error reporting
Previous Message Alvaro Herrera 2021-01-21 21:52:47 Re: libpq debug log