Re: should crash recovery ignore checkpoint_flush_after ?

From: Andres Freund <andres(at)anarazel(dot)de>
To: Justin Pryzby <pryzby(at)telsasoft(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org, Fabien COELHO <coelho(at)cri(dot)ensmp(dot)fr>
Subject: Re: should crash recovery ignore checkpoint_flush_after ?
Date: 2020-01-18 23:22:00
Message-ID: 20200118232200.twuau3kw66ae3kkc@alap3.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

On 2020-01-18 14:11:12 -0600, Justin Pryzby wrote:
> On Sat, Jan 18, 2020 at 10:48:22AM -0800, Andres Freund wrote:
> > On 2020-01-18 08:08:07 -0600, Justin Pryzby wrote:
> > > One of our PG12 instances was in crash recovery for an embarassingly long time
> > > after hitting ENOSPC. (Note, I first started wroting this mail 10 months ago
> > > while running PG11 after having same experience after OOM). Running linux.
> > >
> > > As I understand, the first thing that happens syncing every file in the data
> > > dir, like in initdb --sync. These instances were both 5+TB on zfs, with
> > > compression, so that's slow, but tolerable, and at least understandable, and
> > > with visible progress in ps.
> > >
> > > The 2nd stage replays WAL. strace show's it's occasionally running
> > > sync_file_range, and I think recovery might've been several times faster if
> > > we'd just dumped the data at the OS ASAP, fsync once per file. In fact, I've
> > > just kill -9 the recovery process and edited the config to disable this lest it
> > > spend all night in recovery.
> >
> > I'm not quite sure what you mean here with "fsync once per file". The
> > sync_file_range doesn't actually issue an fsync, even if sounds like it.
>
> I mean if we didn't call sync_file_range() and instead let the kernel handle
> the writes and then fsync() at end of checkpoint, which happens in any
> case.

Yea, but then more writes have to be done at the end, instead of in
parallel with other work during checkpointing. the kernel will often end
up starting to write back buffers before that - but without much concern
for locality, so it'll be a lot more random writes.

> > > 4bc0f16 Change default of backend_flush_after GUC to 0 (disabled).
> >
> > FWIW, I still think this is the wrong default, and that it causes our
> > users harm.
>
> I have no opinion about the default, but the maximum seems low, as a maximum.
> Why not INT_MAX, like wal_writer_flush_after ?

Because it requires a static memory allocation, and that'd not be all
that trivial to change (we may be in a critical section, so can't
allocate). And issuing them in a larger batch will often stall within
the kernel, anyway - there's a limited number of writes the kernel can
have in progress at once. We could make it a PGC_POSTMASTER variable,
and allocate at server start, but that seems like a cure worse than the
disease.

wal_writer_flush_after doesn't have that concern, because it works
differently.

Greetings,

Andres Freund

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Andres Freund 2020-01-18 23:32:02 Re: should crash recovery ignore checkpoint_flush_after ?
Previous Message Daniel Gustafsson 2020-01-18 23:18:18 Re: Online checksums patch - once again