Re: Avoiding smgrimmedsync() during nbtree index builds

From: Andres Freund <andres(at)anarazel(dot)de>
To: Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
Cc: pgsql-hackers(at)postgresql(dot)org, Peter Geoghegan <pg(at)bowt(dot)ie>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>
Subject: Re: Avoiding smgrimmedsync() during nbtree index builds
Date: 2021-01-21 22:51:23
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers


On 2021-01-21 23:54:04 +0200, Heikki Linnakangas wrote:
> On 21/01/2021 22:36, Andres Freund wrote:
> > A quick hack (probably not quite correct!) to evaluate the benefit shows
> > that the attached script takes 2m17.223s with the smgrimmedsync and
> > 0m22.870s passing skipFsync=false to write/extend. Entirely IO bound in
> > the former case, CPU bound in the latter.
> >
> > Creating lots of tables with indexes (directly or indirectly through
> > relations having a toast table) is pretty common, particularly after the
> > introduction of partitioning.
> >
> >
> > Thinking through the correctness of replacing smgrimmedsync() with sync
> > requests, the potential problems that I can see are:
> >
> > 1) redo point falls between the log_newpage() and the
> > write()/register_dirty_segment() in smgrextend/smgrwrite.
> > 2) redo point falls between write() and register_dirty_segment()
> >
> > But both of these are fine in the context of initially filling a newly
> > created relfilenode, as far as I can tell? Otherwise the current
> > smgrimmedsync() approach wouldn't be safe either, as far as I can tell?
> Hmm. If the redo point falls between write() and the
> register_dirty_segment(), and the checkpointer finishes the whole checkpoint
> before register_dirty_segment(), you are not safe. That can't happen with
> write from the buffer manager, because the checkpointer would block waiting
> for the flush of the buffer to finish.

Hm, right.

The easiest way to address that race would be to just record the redo
pointer in _bt_leafbuild() and continue to do the smgrimmedsync if a
checkpoint started since the start of the index build.

Another approach would be to utilize PGPROC.delayChkpt, but I would
rather not unnecessarily expand the use of that.

It's kind of interesting - in my aio branch I moved the
register_dirty_segment() to before the actual asynchronous write (due to
availability of the necessary data), which ought to be safe because of
the buffer interlocking. But that doesn't work here, or for other places
doing writes without going through s_b. It'd be great if we could come
up with a general solution, but I don't immediately see anything great.

The best I can come up with is adding helper functions to wrap some of
the complexity for "unbuffered" writes of doing an immedsync iff the
redo pointer changed. Something very roughly like

typedef struct UnbufferedWriteState { XLogRecPtr redo; uint64 numwrites;} UnbufferedWriteState;
void unbuffered_prep(UnbufferedWriteState* state);
void unbuffered_write(UnbufferedWriteState* state, ...);
void unbuffered_extend(UnbufferedWriteState* state, ...);
void unbuffered_finish(UnbufferedWriteState* state);

which wouldn't just do the dance to avoid the immedsync() if possible,
but also took care of PageSetChecksumInplace() (and PageEncryptInplace()
if we get that [1]).


Andres Freund


In response to


Browse pgsql-hackers by date

  From Date Subject
Next Message Sergey Shinderuk 2021-01-21 23:32:46 Re: [PATCH 1/1] Fix detection of pwritev support for OSX.
Previous Message Tom Lane 2021-01-21 22:32:56 Very misleading documentation for PQreset()