|From:||Andres Freund <andres(at)anarazel(dot)de>|
|Cc:||Peter Geoghegan <pg(at)bowt(dot)ie>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>|
|Subject:||Avoiding smgrimmedsync() during nbtree index builds|
|Views:||Raw Message | Whole Thread | Download mbox | Resend email|
Every nbtree index build currently does an smgrimmedsync at the end:
* Read tuples in correct sort order from tuplesort, and load them into
* btree leaves.
_bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
* When we WAL-logged index pages, we must nonetheless fsync index files.
* Since we're building outside shared buffers, a CHECKPOINT occurring
* during the build has no way to flush the previously written data to
* disk (indeed it won't know the index even exists). A crash later on
* would replay WAL from the checkpoint, therefore it wouldn't replay our
* earlier WAL entries. If we do not fsync those pages here, they might
* still not be on disk when the crash occurs.
In cases we create lots of small indexes, e.g. because of an initial
schema load, partition creation or something like that, that turns out
to be a major limiting factor (unless one turns fsync off).
One way to address that would be to put newly built indexes into s_b
(using a strategy, to avoid blowing out the whole cache), instead of
using smgrwrite() etc directly. But that's a discussion with a bit more
What I wonder is why the issue addressed in the comment I copied above
can't more efficiently be addressed using sync requests, like we do for
other writes? It's possibly bit more complicated than just passing
skipFsync=false to smgrwrite/smgrextend, but it should be quite doable?
A quick hack (probably not quite correct!) to evaluate the benefit shows
that the attached script takes 2m17.223s with the smgrimmedsync and
0m22.870s passing skipFsync=false to write/extend. Entirely IO bound in
the former case, CPU bound in the latter.
Creating lots of tables with indexes (directly or indirectly through
relations having a toast table) is pretty common, particularly after the
introduction of partitioning.
Thinking through the correctness of replacing smgrimmedsync() with sync
requests, the potential problems that I can see are:
1) redo point falls between the log_newpage() and the
write()/register_dirty_segment() in smgrextend/smgrwrite.
2) redo point falls between write() and register_dirty_segment()
But both of these are fine in the context of initially filling a newly
created relfilenode, as far as I can tell? Otherwise the current
smgrimmedsync() approach wouldn't be safe either, as far as I can tell?
|Next Message||Alexey Kondratov||2021-01-21 20:48:08||Re: Allow CLUSTER, VACUUM FULL and REINDEX to change tablespace on the fly|
|Previous Message||Robert Haas||2021-01-21 20:35:40||Re: Race condition in recovery?|